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ADVANCED  NETWORKING  AND  DISTRIBUTED  SYSTEMS 

Defense  Advanced  Research  Projects  Agency 
Annual  Technical  Report 


May  30, 1993 


This  Yearly  Technical  Report  covers  research  carried  out  on  the  Advanced  Networking  and  Dis¬ 
tributed  Systems  Contract  at  UCLA  under  DARPA  Contract  Number  MDA  972-91-J-101 1  cov¬ 
ering  the  period  from  June  1,  1991  through  May  30,  1992.  Under  this  contract  we  have  the  fol¬ 
lowing  statement  of  work  comprising  five  tasks: 


STATEMENT  OF  WORK 
Topic  A:  High  Speed  Networking 

Task  A1 :  Fast  Packet  Switching  Using  Multistage  Interconnection  Networks 

We  propose  to  investigate  the  performance  of  a  variety  of  Multistage  Interconnection  Networks 
such  as  the  Starlite  network.  We  will  develop  analytical  models  to  evaluate  the  throughput  and 
response  time  of  the  overall  traffic  in  the  case  of  uniform  traffic  as  well  as  certain  forms  of  hot 
spot  traffic.  We  will  also  evaluate  the  behavior  of  Message  Combining  to  eliminate  the  effects  of 
hot  spots.  A  transformation  and  superposition  method  is  being  developed  to  be  used  with  the 
analytical  model  to  evaluate  any  given  general  traffic  pattern  (e.g.,  multiple  hot  spots).  A  delay 
model  analysis  comparing  the  discarding  switch  and  the  blocking  switch  will  also  be  developed. 
We  also  propose  to  study  a  structured  buffered  pool  scheme  to  prevent  normal  traffic  from  being 
blocked  by  the  saturated  tree  caused  by  hot  spot  traffic. 

Task  A2:  Analysis  Of  Competing  Lightwave  Networks 

The  use  of  Wavelength  Division  Multiple  Access  (WDMA)  optical  switching  for  high-speed 
packet  networks  is  a  predictable  development  in  the  evolution  of  fast  packet  switching.  We  pro¬ 
pose  to  evaluate  the  behavior  of  single-hop  WDMA  optical  switching,  using  agile  receiver 
filters.  Whereas  our  main  thrust  will  be  on  these  single-hop  structures,  we  will  also  look  at 
multi-hop  access  using  fixed  filters.  We  will  compare  the  response  time,  blocking  and 
throughput  for  each. 


The  views  and  conclusions  contained  in  this  document  are  those  of  the  authors  and  should  not  be 
interpreted  as  necessarily  representing  the  official  policies,  either  expressed  or  implied,  of  the 
Defense  Advanced  Research  Projects  Agency  or  the  United  States  Government. 
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TOPIC  B:  ARCHITECTURE  AND  PARALLEL  PROCESSING  / 

Task  B1 :  P erformance  Of  Boolean  n-Cube  Interconnection  Networks 

We  propose  to  evaluate  the  performance  of  Boolean  n-cube  interconnection  networks  for  paral¬ 
lel  processing  systems.  The  focus  will  be  on  data  communication  issues  rather  than  on  process¬ 
ing  issues.  By  exploiting  the  homogeneity  property  of  Boolean  n-cube  interconnection  networks, 
we  can  design  non-blocking  routing  algorithms  with  limited  size  buffers.  A  technique  called  re¬ 
ferral  is  used  to  guarantee  that  eveiy  node  accepts  all  the  messages  transmitted  from  its  neigh¬ 
bors.  This  type  of  routing  algorithm  is  critical  in  any  implementation.  Store-and-forward  is  one 
such  routing  algorithm.  In  this  scheme,  time  is  divided  into  cycles  to  which  the  network  is  syn¬ 
chronized.  In  each  cycle  every  node  simultaneously  transmits  some  of  its  stored  messages  to  its 
neighbors.  An  analytical  model  will  be  developed  to  predict  the  network  performance  under  dif¬ 
ferent  traffic  patterns.  We  also  intend  to  design  an  intelligent  routing  algorithm  to  improve  the 
performance.  Another  routing  scheme  to  consider  is  a  modified  version  of  virtual  cut-through. 
Virtual  cut-through  is  a  scheme  such  that  when  a  message  arrives  at  an  intermediate  node  and  its 
selected  outgoing  channel  is  free,  then  the  message  is  sent  to  the  adjacent  node  before  it  is  com¬ 
pletely  received  at  this  intermediate  node.  Therefore,  the  delay  due  to  unnecessary  buffering  in 
front  of  an  idle  channel  is  avoided.  Modified  virtual  cut-through  is  also  a  non-blocking  algo¬ 
rithm.  We  will  investigate  the  (positive  or  negative)  effect  of  adding  additional  buffers  to  a  node 
m  this  case.  We  are  further  interested  in  non-uniform  traffic  problems  in  Boolean  n-cube  net¬ 
works. 


We  also  propose  to  study  the  performance  of  these  networks  in  a  hostile  and/or  unreliable  en¬ 
vironment.  In  this  environment,  nodes  and  links  may  disappear  and  also  unreliable  (i.e.,  noisy) 
transmissions  may  occur. 

Task  B2:  Distributed  Simulation 

Parallel  asynchronous  simulation  methods  (such  as  Time  Warp)  offer  an  optimistic  alternative  to 
synchronous  conservative  approaches  to  distributed  simulation.  We  propose  to  evaluate  the 
speedup  of  P  processors  conducting  a  parallel  asynchronous  simulation  using  analytic  and  simu¬ 
lation  tools.  We  already  have  an  exact  solution  for  the  case  of  two  processors  (P=2).  Also,  we 
have  upper  bounds  on  the  best  one  can  do  by  letting  the  P  processors  run  ahead  of  each  other  as 
compared  to  forcing  them  to  synchronize  at  every  step.  We  are  interested  in  extending  the  results 
to  P  processors  and  to  include  the  effect  of  queued  messages.  Furthermore,  we  propose  to  inves¬ 
tigate  the  use  of  the  linear  Poisson  process  as  a  model  for  these  systems. 

Task  B3:  A  New  Model  Of  Load  Sharing 

We  are  interested  in  studying  the  behavior  of  interacting  processes  which  gobble  up  processing 
resources  in  their  neighborhood.  In  particular,  if  we  begin  with  a  one-dimensional  world,  we  can 
place  processes  on  a  ring,  where  there  is  a  quantity  of  processing  power  distributed  uniformly 
around  the  ring.  A  process  requires  a  changing  amount  of  processing  capacity.  As  its  needs  in¬ 
crease,  the  process  attempts  to  grow  in  both  directions  along  the  ring  until  it  either  has  enough 
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capacity,  or  it  bumps  into  another  process  moving  in  its  direction,  in  which  case  they  both  stop 
moving  toward  each  other.  As  time  progresses,  a  process  may  or  may  not  have  all  the  capacity  it 
needs.  The  object  is  to  study  the  response  time  of  jobs  represented  by  such  processes  in  a  limit¬ 
ed  resource,  competitive  environment.  Clearly,  this  model  extends  to  higher  dimensions,  and  we 
propose  to  study  the  case  where  processors  are  distributed  over  a  multi-dimensional  hypersphere. 
1  he  effect  of  distributed  load  sharing  in  this  environment  will  be  evaluated. 


Accomplishments  for  this  Period 

During  we  have  graduated  1  Ph.D.  student,  have  had  9  papers  published,  2  papers  accepted  and  2 

papers  submitted  to  the  professional  literature.  Progress  continues  to  be  made  in  our  stated 
goals. 

We  have  encountered  no  obstacles  to  our  research  and  have  made  some  major  contributions  in 
new  areas  as  well  as  in  extensions  of  previous  work  on  this  grant. 

We  have  developed  some  important  results  in  the  area  of  parallel  processing  systems.  In  particu¬ 
lar,  we  established  a  metric  which  allows  us  to  model  the  behavior  of  programs  in  a  way  which 
identifies  the  optimum  number  of  processors  one  should  assign  to  a  job.  This  has  led  to  a  gen¬ 
eralization  of  Amdahl’s  Law  as  well  as  a  simple  rule  that  specifies  the  optimum  number  of  pro¬ 
cessors.  One  paper,  "On  Parallel  Processing  Systems:  Amdahl’s  Law  Generalized  and  Some 
Results  on  Optimal  Design"  by  Kleinrock  and  Huang,  which  is  attached  to  this  report,  addresses 
the  optimization  problem  in  the  case  where  any  processors  that  go  idle  in  the  course  of  the  com¬ 
putation  are  not  allowed  to  be  temporarily  reassigned  to  other  jobs  waiting  in  the  queue.  In 
another  paper  (also  attached),  "Performance  Evaluation  of  Dynamic  Sharing  of  Processors  in 
Two-Stage  Parallel  Processing  Systems"  by  Huang  and  Kleinrock  does  allow  this  reuse  of  idle 
processors  by  waiting  jobs. 


We  had  earlier  reported  on  our  results  for  Time-Warp  (optimistic)  Simulation  methods.  A  relat¬ 
ed  study,  represented  by  the  attached  paper  entitled  "The  Virtual  Time  Data-Parallel  Machine" 
by  Shen  and  Kleinrock,  evaluated  the  performance  of  aggressive  processing  of  ready  tasks  at  the 
instruction  level.  This  approach  shows  significant  increases  in  performance  over  the  synchron¬ 
ized  approach  to  instruction  execution. 

A  universa1  result  was  established  in  the  attached  paper,  "Depth-  First  Heuristic  Search  on  a 
SIMD  Machine"  by  Powley,  Ferguson  and  Korf.  This  result  determined  the  optimum  times 
when  a  SIMD  processing  system  should  do  load  balancing. 

In  many  of  our  investigations,  we  continue  to  encounter  the  issue  of  how  to  use  processing 
power  that  has  temporarily  become  available.  We  address  the  gains  to  be  had  in  this  case  direct¬ 
ly  in  the  attached  paper,  "Collecting  Unused  Processing  Capacity:  An  Analysis  of  Transient  Dis¬ 
tributed  Systems"  by  Kleinrock  and  Korfhage.  We  determine  the  response  time  of  a  job  when  it 
has  available  a  randomly  changing  number  of  processors  that  cooperate  in  its  execution. 
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Our  work  in  high  speed  networking  with  fiber  optics  continues  to  produce  valuable  results. 
Indeed,  we  have  established  the  performance  limits  of  an  ideal  fiber-optic  wave-length-division 
switch  as  a  function  of  the  number  of  tunable  and  fixed  transmitters  and  receivers.  This  work  is 
reported  in  the  attached  paper,  "Performance  Analysis  of  Single-Hop  Wavelength  Division  Mul¬ 
tiple  Access  Networks"  by  Lu  ad  Kleinrock. 

Below  we  list  our  publications  for  this  period.  Progress  continues  along  a  number  of  fronts,  and 
we  are  beginning  to  study  the  performance  of  multi-hop,  multi-channel  wireless  communication 
systems  in  a  mobile  environment,  as  well  as  adaptive  and  self-regulating  automata  in  a  loosely 
coupled  environment. 
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Huang,  J.H.  and  L.  Kleinrock,  "Performance  Evaluation  of  Dynamic  Sharing  of  Processors  in 
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Kleinrock,  L.  and  W.  Korfhage,  "Collecting  Unused  Processing  Capacity:  An  Analysis  of  Tran¬ 
sient  Distributed  Systems",  IEEE  TPDS,  May  1993,  pp.  535,546. 

PAPERS  ACCEPTED  FOR  PUBLICATION 

1.  Tung,  Brian  and  L.  Kleinrock,  "Distributed  Control  Methods"  accepted  to  2nd  Interna¬ 
tional  Symposium  on  High  Performance  Distributed  Computing,  Spokane  Washington, 
July  21,23, 1993. 

2.  Felderman,  R.  and  L.  Kleinrock,  "Two  Processor  Time  Warp  Analysis:  Capturing  the 
Effects  of  Message  Queueing  and  Rollback/State  Saving  Costs,"  UCLA  Technical  Re¬ 
port  CSD-920035,  accepted  for  Memorial  Issue  for  Felix  Pollaczek.  AEU  special  issue 
"Teletraffic  Theory  and  Engineering". 

PAPERS  SUBMITTED  FOR  PUBLICATION 

1.  Lu,  J.  C.  and  Kleinrock,  L.,  "A  WDMA  Protocol  for  Multichannel  DQDB  Networks" 
submitted  to  GLOBECOM  ’93,  January  1993 

2.  Tung,  Brian  and  L.  Kleinrock  "Distributed  Control  Using  Simple  Automata"  submitted  to 
IEEE  Transactions  on  Parallel  and  Distributed  Systems,  December  1992 

PAPERS  IN  PREPARATION 

1.  Kleinrock,  L.  and  D.  Nielsen,  "Data  Structures  and  Algorithms  for  the  Efficient  Imple¬ 
mentation  of  Structural  Level  Reduction  of  the  GSPN  Model",  to  be  submitted  to  Inter¬ 
national  Workshop  on  Petri  Nets  and  Performance  Models. 

2.  Kleinrock,  L.  and  D.  Nielsen,  "Analysis  Techniques  to  Increase  the  Computational 
Power  of  the  GSPN-Reward  Model",  to  be  submitted  to  International  Workshop  on  Petri 
Nets  and  Performance  Models. 


THE  VIRTUAL-TIME  DATA-PARALLEL  MACHINE 


Shioupyn  Shen 
Leonard  Kleinrock 


Reprinted  from  PROCEEDINGS  OF  THE  FOURTH  SYMPOSIUM  ON  THE 
FRONTIERS  OF  MASSIVELY  PARALLEL  COMPUTATION  (FRONTIERS  ’92), 
McLean,  Virginia,  October  19-21,  1992 


Wash  motors,  DC 


Los  Alamitos  c  Brussels  ©  Tokyo 


Pfe'i  THE  INSTITUTE  OF  ELECTRICAL  AND  ELECTRONICS  ENGINEERS,  INC. 


&  \ 

,  l\h  IEEE  COMPUTER  SOCIETY 


IEEE 


The  Virtual-Time  Data-Parallel  Machine 

Shioupyn  Shen  and  Leonard  Kleinrock 


Computer  Science  Department 
University  of  California,  Los  Angeles 
Los  Angeles,  CA  90024-1596 


Abstract 

We  propose  the  “Virtual- Time  Data-Parallel  Ma¬ 
chine”  to  execute  SIMD  (Single  Instruction  Multiple 
Data)  programs  asynchronously.  We  first  illustrate 
how  asynchronous  execution  is  more  efficient  than  syn¬ 
chronous  execution.  For  a  simple  model,  we  show  that 
asynchronous  execution  outperforms  synchronous  ex¬ 
ecution  roughly  by  a  factor  of  (In  TV),  where  N  is  the 
number  of  processors  in  the  system.  We  then  explore 
how  to  execute  SIMD  programs  asynchronously  with¬ 
out  violating  the  SIMD  semantics.  We  design  a  FIFO 
priority  cache ,  one  for  each  processing  element,  to 
record  the  recent  history  of  all  variables.  The  cache, 
which  is  stacked  between  the  processor  and  the  mem¬ 
ory,  supports  asynchronous  execution  in  hardware  effi¬ 
ciently  and  preserves  the  SIMD  semantics  of  the  soft¬ 
ware  transparently.  Analysis  and  simulation  results 
indicate  that  the  Virtual-Time  Data-Parallel  Machine 
can  achieve  linear  speed-up  for  computation  intensive 
data-parallel  programs  when  the  number  of  processors 
is  large. 

1  Introduction 

For  the  past  twenty  years,  solid  state  technology  has 
been  much  more  successful  in  reducing  the  cost  of 
VLSI  chips  than  in  increasing  the  peak  speed  of  ECL 
circuits.  As  a  direct  result,  the  architectural  superi¬ 
ority  of  supercomputers  is  vanishing  because  we  can 
easily  implement  most  of  the  advanced  features  of  su¬ 
percomputers  on  a  single  chip1.  The  performance  gap 
between  the  fastest  processor  (in  terms  of  mips)  and 
the  most  cost-effective  processor  (in  terms  of  mips/$) 
is  diminishing  rapidly.  In  the  future,  the  key  to  su¬ 
percomputing  will  not  be  the  high  speed  of  a  single 
processor;  instead,  it  will  be  the  high  degree  of  paral¬ 
lelism. 

The  difficulties  of  parallel  processing  are  two-fold. 
The  first  problem  is  that  the  computational  model  is 
hard  to  use  (for  asynchronous  execution)  and  the  sec¬ 
ond  problem  is  that  the  hardware  efficiency  is  poor  (for 
synchronous  execution).  We  propose  the  “Virtual- 
Time  Data-Parallel  Machine”  to  solve  both  problems 


'There  are  approximately  three  million  transistors  in  an  Intel 
80586  microprocessor  but  only  two  million  transistors  in  a  CRAY-1 
supercomputer. 


at  once.  The  concept  of  this  machine  is  derived  from 
“The  Connection  Machine”  [4]  and  “Virtual  Time”  [6], 

The  Connection  Machine  introduced  the  data- 
parallel  computational  model  [5],  The  SIMD  seman¬ 
tics  of  the  data-parallel  model  make  it  easy  to  develop 
parallel  programs  and  make  it  capable  of  expressing 
fine-grain  parallelism.  Though  the  Connection  Ma¬ 
chine  achieves  significant  speed-up  for  a  large  number 
of  processors,  hardware  efficiency  may  be  poor  because 
of  its  requirement  for  synchronous  execution. 

Virtual  Time  introduced  the  “Time  Warp”  synchro¬ 
nization  mechanism  for  parallel  discrete  event  simu¬ 
lation  [7],  The  optimistic  approach  of  Time  Warp 
eliminates  unnecessary  blocking,  and  therefore  makes 
better  use  of  the  hardware.  However,  it  is  hard  to 
generalize  Virtual  Time  to  other  paradigms  of  parallel 
processing. 

We  suggest  the  use  of  Time  Warp  to  execute  data- 
parallel  programs  asynchronously  in  hopes  of  exploit¬ 
ing  more  parallelism  and  obtaining  better  efficiency. 
By  performance  modeling,  we  show  that  the  efficiency 
(i.e.,  the  sustained  speed  over  the  raw  speed)  of  the 
system  is  asymptotically  40%  for  a  large  number  of 
processors;  this  is  a  significant  improvement  over  the 
traditional  approach. 

The  organization  of  this  paper  is  as  follows:  Sec¬ 
tion  2  provides  the  motivation,  Section  3  explores  the 
key  concept,  Section  4  addresses  performance  model¬ 
ing,  Section  5  describes  the  hardware  support,  Sec¬ 
tion  6  discusses  the  extensions,  and  the  last  section 
concludes  the  paper. 

2  Motivation  -  The  Interconnection 
Network  Bottleneck 

The  data-parallel  approach  has  been  very  successful  in 
solving  or  avoiding  many  of  the  technical  difficulties 
of  parallel  processing. 

Data-parallel  computers  would  be  the  obvious 
champion  in  the  parallel  processing  arena  if  the  in¬ 
terconnection  network  were  not  the  bottleneck. 

The  performance  of  the  data-parallel  approach  is 
more  sensitive  to  the  interconnection  network  than 
that  of  the  other  approaches  because  of  its  SIMD  se¬ 
mantics.  Even  though  all  processors  start  executing 
the  same  instruction  simultaneously,  they  seldom  fin¬ 
ish  this  instruction  together  for  many  reasons.  For 
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Figure  1:  A  hypothetical  example  of  remote  access 
time  distribution. 

example,  the  access  of  remote  operands  (i.e.,  variables 
on  other  processing  elements)  may  take  vastly  different 
amounts  of  time  due  to  network  contention  and  block¬ 
ing.  As  a  result,  the  execution  time  of  an  instruction 
varies  among  the  processors. 

Figure  1  shows  a  hypothetical  example  of  the  re¬ 
mote  access  time  distribution,  where  f(t)  and  F(t)  are 
the  probability  density  function  and  probability  distri¬ 
bution  function  of  the  remote  access  time,  respectively. 
Though  the  remote  access  time  distributions  for  vari¬ 
ous  interconnection  networks  are  different,  they  have 
several  characteristics  in  common  -  they  have  large 
mean  and  variance,  and  more  importantly,  their  prob¬ 
ability  density  functions  have  long,  tiny  tails.  The 
long  tiny  tail  has  little  influence  on  the  mean  remote 
access  time  because  it  is  so  tiny.  However,  it  is  the 
long  (though  tiny)  tail  that  drives  the  performance 
down. 

The  synchronous  execution  of  SIMD  programs 
forces  a  processor  which  finishes  the  instruction  early 
wait  until  all  processors  finish  this  instruction.  There¬ 
fore,  what  really  counts  is  the  longest  execution  time 
(in  other  words,  the  worst  case  in  remote  access  time) 
across  all  processors.  The  maximum  value  of  N  inde¬ 
pendent  samples  is  approximately  F_1(l  —  ^),  where 
F~l(t)  is  the  inverse  function  of  F(t).  For  large  N, 
the  above  term  is  determined  by  the  long,  tiny  tail  of 
f(t)  as  shown  in  Fig.  2. 

In  our  experience  and  that  of  the  others,  the  critical 
bottleneck  of  data-parallel  computing  is  the  intercon¬ 
nection  network.  Even  though  the  bandwidth  of  the 
interconnection  network  is  large,  we  cannot  justify  the 
cost  of  providing  sufficient  bandwidth  to  reduce  the 
maximum  remote  access  time  for  random  communica¬ 
tion  patterns.  We  would  like  to  know  how  good  the 
performance  will  be  if  we  smooth  out  the  variation  of 
the  remote  access  time.  If  the  performance  is  very 
good,  we  also  would  like  to  figure  out  how  to  do  it  at 
acceptable  cost. 


Figure  2:  The  longest  remote  access  time  for  N  pro¬ 
cessors  is  F_1(l  -  ^),  which  is  mainly  determined  by 
the  long  tiny  tail  of  the  probability  density  function 
/(*)■ 


First,  we  show  that  the  system  can  achieve  linear 
speed-up  (constant  efficiency)  for  a  large  number  of 
processors.  Second,  we  propose  a  very  cost-effective 
solution  (asynchronous  execution)  to  reduce  the  sus¬ 
ceptibility  to  the  variation  of  remote  access  time  with¬ 
out  modifying  the  interconnection  network.  Our  ap¬ 
proach  is  to  attach  minimal  hardware  support  to  every 
processing  element  to  “neutralize”  the  network  haz¬ 
ards,  instead  of  resorting  to  an  expensive  “upgrade”  of 
the  interconnection  network.  We  can  achieve  roughly 
the  same  performance  as  if  the  remote  access  time 
is  constant  but  with  twice  the  mean.  This  approach 
trades  the  variance  for  a  larger  mean.  The  above 
trade-off  is  favorable  because  the  major  bottleneck  is 
in  the  variance  instead  of  the  mean,  especially  for  sys¬ 
tems  with  a  large  number  of  processors. 

In  addition  to  the  remote  operand  fetch,  the  actual 
computation  of  the  instruction  sometimes  introduces 
large  variations  into  the  instruction  execution  time 
as  well.  Conditional  enabling/ disabling  is  a  common 
practice  in  data-parallel  programming.  Even  though 
it  takes  constant  amount  of  time  for  the  enabled  pro¬ 
cessors  to  execute  the  instruction,  collectively  speak¬ 
ing,  the  instruction  execution  time  varies  because  it 
takes  no  time  for  the  disabled  processors  to  skip  the 
instruction.  If  the  disabling  probability  is  high2,  then 
the  execution  time  varies  a  lot.  In  this  paper,  we  use 
the  generic  term  “instruction  execution  time” ,  which 
may  refer  to  either  the  remote  access  time  or  the  com¬ 
putation  time  or  both. 


2Fbr  example,  a  tree-reduction  operation  [5]  of  size  N  takes 
(log2  N)  iterations  for  a  total  of  ( N  -  1)  operations.  The  disabling 
probability  is  as  high  as  (1  -  iog^ //)♦ 
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3  The  Key  Concept  —  Asynchronous 
SIMD 

The  Connection  Machine  is  a  typical  example  of  the 
traditional  data-parallel  machine,  which  has  the  fol¬ 
lowing  characteristics  -  (i)  SIMD,  (ii)  distributed 
memory,  (iii)  massive  parallelism,  and  (iv)  pro¬ 
grammable  interconnection.  We  are  perfectly  happy 
with  these  properties  except  the  first  one  -  SIMD, 
or  more  precisely,  synchronous  execution  of  SIMD 
programs.  The  inefficiency  of  synchronous  execution 
comes  from  unnecessary  blocking.  Processors  that 
finish  the  current  instruction  early  are  blocked  un¬ 
til  all  processors  finish  this  instruction,  even  though 
the  operands  needed  to  execute  the  next  instruction 
may  be  available.  As  we  know,  synchronous  execu¬ 
tion  is  a  direct  way  to  enforce  the  “SIMD  semantics”, 
but  what  really  matters  is  the  SIMD  semantics  itself, 
rather  than  the  synchronous  execution. 

SIMD  semantics  is  in  fact  a  kind  of  causality  con¬ 
straint,  which  is  explained  as  follows.  The  execu¬ 
tion  of  the  i-th  instruction  (an  event  “scheduled”  at 
“simulation”  time  i)  depends  on  the  execution  of  the 
(i-l)-th  instruction  (an  event  scheduled  at  simula¬ 
tion  time  i-1).  The  sequence  count  of  the  instruc¬ 
tion  stream  is  analogous  to  the  simulation  time,  which 
specifies  when  an  event  should  happen,  in  contrast  to 
“real-time”  when  the  event  does  happen.  The  data- 
dependency  constraint  of  the  SIMD  semantics  is  thus 
equivalent  to  the  causality  constraint  of  parallel  dis¬ 
crete  event  simulation  (PDES). 

The  synchronous  execution  of  SIMD  programs  is  es¬ 
sentially  the  time-stepped  execution  of  PDES,  which 
is  considered  an  inefficient  implementation  of  PDES. 
On  the  other  hand,  the  optimistic  approach  of  Vir¬ 
tual  Time  employs  periodic  state-saving  so  that  pro¬ 
cessors  have  more  freedom  to  go  ahead  instead  of  be¬ 
ing  blocked  unnecessarily.  The  Virtual-Time  Data- 
Parallel  Machine  takes  a  similar  (but  not  identical) 
approach  -  the  execution  of  the  next  instruction  can 
proceed  independently  of  the  progress  on  other  proces¬ 
sors  as  long  as  its  own  data  dependencies  are  satisfied 
and  its  current  state  is  properly  saved. 

Figure  3  illustrates  how  asynchronous  execution  of 
SIMD  programs  is  far  more  efficient  than  the  con¬ 
ventional  synchronous  execution.  In  a  task  graph, 
nodes  correspond  to  tasks  (instructions)  and  links  Cor¬ 
respond  to  causality  constraints  (data  dependencies). 
Figure  3.a  shows  the  intrinsic  data  dependencies  of  an 
example  program,  which  ignores  all  artifacts  due  to 
the  execution  model.  When  synchronous  execution  is 
enforced,  it  is  equivalent  to  adding  more  links  to  the 
task  graph  such  that  every  task  depends  on  all  the 
tasks  one  row  above  it.  Figure  3.b  shows  the  large 
number  of  additional  data  dependencies  of  the  pro¬ 
gram  caused  by  the  requirements  of  the  synchronous 
execution  model. 

We  know  that  adding/removing  links  to  a  task 
graph  decreases/increases  the  parallelism  of  the  task 
graph,  respectively.  The  Virtual-Time  Data-Parallel 
Machine  promotes  asynchronous  execution  by  remov¬ 
ing  those  extra  links  associated  with  synchronous  exe¬ 
cution  (i.e.,  to  achieve  better  performance)  while  pre- 


a)  Data  Dependency  of  the  Program 
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b)  Data  Dependency  of  Synchronous  Execution 


Figure  3:  The  task  graph  representation  of  data  de¬ 
pendencies. 


serving  the  original  data  dependencies  (i.e.,  without 
sacrificing  the  semantics). 

4  Performance  Modeling  -  Partial 
Synchronization 

A  simple  model  of  the  Virtual-Time  Data-Parallel  Ma¬ 
chine  is  the  “partial  synchronization”  model  [1],  in 
contrast  to  “total  synchronization”  model  (i.e.,  barrier 
synchronization).  The  model  is  as  follows.  The  SIMD 
machine  consists  of  N  homogeneous  processors,  i.e., 
every  processor  has  the  same  processing  power  and 
executes  the  same  instruction  stream  such  that  the 
behavior  of  every  processor  is  statistically  equivalent. 
The  task  graph  (Fig.  4)  is  an  oo  by  N  matrix,  one  col¬ 
umn  for  each  processor.  Each  task  depends  on  a  set 
of  tasks  one  row  above  it.  Let  the  size  of  this  set  be 
A,  which  is  the  number  of  immediate  ancestors  of  this 
task.  If  A  =  N  for  all  tasks,  then  it  is  total  synchro¬ 
nization;  otherwise,  it  is  partial  synchronization.  We 
are  interested  in  the  case  where  A  is  a  small  number3. 

3If  the  tasks  correspond  to  instructions,  then  the  value  of  A  is 
analogous  to  the  number  of  operands  of  an  assembly  instruction, 
which  is  usually  small. 
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Figure  4:  A  snapshot  of  a  task  graph  in  execution. 


For  partial  synchronization,  the  instructions  in  ex- 
ecution  spread  out  in  virtual-time4.  However,  this  dis¬ 
persion  is  confined  due  to  the  data  dependency  con¬ 
straints.  In  order  to  describe  the  dynamic  behavior 
of  the  system,  we  introduce  the  following  terms  (refer 
to  Fig.  4).  The  global  virtual-time  (GVT)  of  the  sys¬ 
tem  is  defined  to  be  the  minimum  virtual-time  of  all 
processors,  i.e., 

GVT  =  min{virtual-time}  (1) 

Vproc. 

The  relative  virtual-time  (rvt)  of  an  instruction  (or  a 
processor)  is  defined  to  be  the  difference  between  its 
virtual-time  and  the  GVT,  i.e., 

rvt  =  virtual-time  -  GVT  (2) 

The  dispersion  function  describes  how  processors  scat¬ 
ter  in  virtual-time,  i.e., 

dispersion(t)  =  Prob[rvt  =  i)  (3) 

which  can  be  interpreted  as  the  probability  that  the 
rvt  of  a  (tagged)  processor  is  i  (averaged  over  a  long 
period  of  real-time),  or  as  the  distribution  of  proces¬ 
sors  in  virtual-time  fat  some  real-time  instant).  Fig¬ 
ure  4  shows  a  snapshot  of  the  system,  and  Figure  5 
illustrates  its  dynamic  behavior. 

4In  this  paper,  “virtual- time”  and  “simulation- time”  are  used 
interchangeably,  as  are  “instruction”  and  “task”. 


We  make  the  following  assumptions  for  the  purpose 
of  performance  modeling: 

1.  The  execution  time  of  tasks  is  exponentially  dis¬ 
tributed. 

2.  The  number  of  immediate  ancestors  (A)  is  exactly 
two. 

3.  One  ancestor  is  the  preceding  task  on  the  same 
processor,  and  the  other  ancestor  is  uniformly  dis¬ 
tributed  among  the  tasks  in  the  preceding  row. 

Though  these  assumptions  are  not  particularly  real¬ 
istic,  they  are  simple  enough  to  capture  some  funda¬ 
mental  insight  as  to  how  asynchronous  execution  out¬ 
performs  synchronous  execution. 

We  are  interested  in  the  following  performance  mea¬ 
sures: 

Speed-Up: 


Speed-Up  = 

execution  time  with  a  single  processor 
execution  time  with  N  processors 


(4) 


Efficiency: 

Efficiency  =  ^HE  (5) 

Efficiency-Gain: 

Efficiency  Gain  = 

efficiency  of  asynchronous  execution  ^ 
efficiency  of  synchronous  execution 

Figure  6  shows  the  speed-up  and  efficiency  of  syn¬ 
chronous  execution  from  analysis  [2].  The  efficiency  of 
synchronous  execution  drops  (to  zero)  as  the  number 
of  processors  increases.  Figure  7  shows  the  speed-up 
and  efficiency  of  asynchronous  execution  as  obtained 
from  simulation5.  Analytical  results  [8]  show  that  the 
asymptotic  efficiency  for  a  large  number  of  processors 
is  approximately  40%.  Figure  8  shows  that  the  ef¬ 
ficiency  gain  is  proportional  to  the  logarithm  of  the 
number  of  processors.  From  this  figure  we  see  the  mo¬ 
tivation  for  considering  asynchronous  execution. 

We  now  address  the  scalability  of  the  Virtual-Time 
Data- Parallel  Machine.  The  traditional  definition  of 
scalability  is  with  respect  to  the  speed-up  of  running 
the  same  program  on  an  increasing  number  of  pro¬ 
cessors.  This  definition  is  not  directly  applicable  to 
data-parallel  machines  where  the  number  of  proces¬ 
sors  is  on  the  same  order  as  the  intrinsic  parallelism 
of  the  program.  When  the  number  of  processors  in¬ 
creases,  the  problem  size  must  increase  proportionally 
so  that  the  intrinsic  parallelism  increases  proportion¬ 
ally  as  well.  If  a  system  scales  up  well,  the  execution 
time  is  relatively  constant. 

5  Analyses  are  also  available  [8]  for  an  upper-bound,  a  lower 
bound,  and  an  approximation  to  the  efficiency  of  asynchronous 
execution. 
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Figure  5:  The  distribution  of  processors  in  virtual-time  at  several  real-time  instants.  GVT(t ),  global  virtual¬ 
time  as  a  function  of  real-time,  summarizes  the  progress  of  program  execution  at  real-time  t.  In  this  figure 
GVT(  10)  =  4,  GVT(20)  =  8,  and  GVT( 30)  =  12.  ’ 


Efficiency  Speed-Up  Efficiency  Speed-Up 


Figure  6:  Speed-up  and  efficiency  of  synchronous  ex¬ 
ecution. 


An  architecture  simulator  has  been  developed  to 
run  “real”  data-parallel  programs  on  the  Virtual-Time 
Data-Parallel  Machine  to  illustrate  its  superb  scalabil¬ 
ity.  The  main  assumption6  made  in  the  simulator  is 
the  randomness  in  instruction  execution  time  (in  fact, 
the  remote  access  time) .  Figure  9  shows  the  time  re¬ 
quired  to  solve  a  Laplace’s  equation  asynchronously 
versus  synchronously.  This  diagram  reveals  that  asyn- 


6  We  also  make  other  assumptions  on  the  number  of  cycles  for 
integer  and  floating-point  operations.  As  long  as  the  number  of 
cycles  for  computation  is  less  than  that  for  communication,  the 
simulation  results  are  not  sensitive  to  these  assumptions. 


Figure  7:  Speed-up  and  efficiency  of  asynchronous  ex¬ 
ecution. 


chronous  execution  scales  up  well  as  shown  by  the 
almost  constant  execution  time,  while  the  execution 
time  for  synchronous  execution  increases.  The  above 
argument  does  not  mean  that  asynchronous  execution 
is  more  favorable  than  synchronous  execution  in  solv¬ 
ing  partial  differential  equations  since,  had  the  instruc¬ 
tion  execution  time  been  constant,  as  if,  with  such 
equations,  then  the  synchronous  and  asynchronous  ap¬ 
proaches  would  have  behaved  similarly..  However,  it 
indicates  that  asynchronous  execution  is  capable  of 
smoothing  out  the  variations  of  instruction  execution 
time. 
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Figure  8:  The  efficiency  gain  -  asynchronous  over  syn¬ 
chronous. 

5  Hardware  Support  -  The  FIFO  Pri¬ 
ority  Cache 

During  asynchronous  execution,  processors  are  al¬ 
lowed  to  spread  out  at  different  virtual-times.  Con¬ 
sider  the  case  in  which  one  processor  at  a  smaller 
virtual- time  (say,  i)  sends  a  memory  request7  to  an¬ 
other  processor  at  a  larger  virtual-time  (j,  where 
j  >  i).  The  time  (virtual- time)  of  the  request  is  cur¬ 
rent  to  the  former  processor  but  previous  to  the  latter 
processor.  The  requested  value  has  actually  been  gen¬ 
erated  in  a  previous  instruction  (before  i)  on  the  latter 
processor,  and  is  subject  to  being  overwritten  by  in¬ 
structions  between  i  and  (j- 1).  In  order  to  prevent 
overwriting  useful  data,  every  processor  must  save  all 
previous  values  of  its  variables  (i.e.,  the  memory  “his¬ 
tory”)  back  to  GVT (since  no  processor  has  a  virtual- 
time  earlier  than  GVT,  no  values  earlier  than  GVTwill  be 
requested  in  the  future).  For  practical  reasons,  there 
is  a  physical  limit  on  the  size  of  the  memory  history, 
say  K .  If  a  processor  goes  so  fast  that  it  is  K  instruc¬ 
tions  ahead  of  GVT,  it  must  be  temporarily  suspended 
because  it  has  used  up  its  memory  history.  Analysis 
[8]  shows  that  if  K  is  greater  than  (In  N),  the  above 
situation  rarely  occurs  and  the  performance  hardly  de¬ 
grades  due  to  the  limited  size  of  memory  history.8 

Memory  history  is  so  important  and  so  frequently 
used  that  it  deserves  special  hardware  support.  We 
have  designed  a  “FIFO  priority  cache”  (which  imple¬ 
ments  the  incremental  backup  algorithm)  for  the  mem¬ 
ory  history  (one  cache  per  processor)  with  the  follow¬ 
ing  characteristics. 

FIFO  Queueing:  Write  requests  are  not  executed 
immediately;  instead,  they  are  pushed  into  a 

7 Memory  requests  are  time-stamped  to  unambiguously  specify 
the  requested  values. 

8 What  a  coincidence!  (In  N)  happens  to  be  the  performance  gain 
as  well. 


Figure  9:  The  execution  time  of  a  data-parallel  pro¬ 
gram  when  the  number  of  processors  increases  with 
the  problem  size. 


FIFO  queue  (of  size  K).  If  the  queue  is  full,  the 
oldest  pending  write  request  (i.e.,  the  one  with 
the  earliest  virtual-time)  is  popped  out  of  the 
FIFO  queue  and  then  sent  to  the  main  memory 
(i.e.,  to  update  the  main  memory). 

Associative  Search:  For  every  read  request  whose 
virtual-time  is  less  than  or  equal  to  the  program 
counter  of  the  processor,  we  conduct  an  associa¬ 
tive  search  (like  a  cache  memory)  in  the  queue 
for  hits,  where  a  hit  is  any  pending  write  request 
in  the  queue  with  matching  address  and  earlier 
virtual-time. 

Priority  Arbitration:  If  there  is  more  than  one  hit, 
we  choose  the  latest  hit,  i.e.,  the  one  with  the 
largest  virtual-time.  Priority  arbitration  (e.g., 
choosing  the  latest  hit)  can  be  implemented  by 
a  priority  encoder/ decoder  pair.  If  there  is  no 
hit  (i.e.,  a  cache  miss),  then  we  consult  the  main 
memory  because  the  requested  value  is  stored  in 
the  main  memory. 

Related  research  on  the  space-time  memory  can  be 
found  in  [3]  and  [8] . 

Synchronous  SIMD  machines  can  hardly  benefit 
from  the  cache  memory  because  a  cache  miss  for  one 
processor  is  aggravated  to  a  cache  miss  for  the  whole 
system.  Asynchronous  execution  allows  the  system  to 
take  full  advantage  of  the  cache  memory  technology  to 
resolve  the  speed  discrepancy  between  the  CPU  and 
the  main  memory.  A  small  FIFO  priority  cache  not 
only  supports  the  memory  history,  but  also  accelerates 
memory  references. 

6  Extensions  —  Two-Phase  Write 

Even  though  we  can  smooth  out  the  variations  of  the 

remote  access  time,  the  interconnection  network  may 
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still  be  the  performance  bottleneck  (however,  the  bot¬ 
tleneck  is  less  severe  than  before).  The  problem  resides 
in  the  mismatch  of  local  access  time  and  remote  ac¬ 
cess  time.  The  technology  is  such  that  the  speed  of 
the  processor  (and  the  memory)  has  improved  quickly 
but  the  speed  (in  terms  of  latency  instead  of  through¬ 
put)  of  the  interconnection  network  has  improved  at 
a  much  slower  rate.  As  a  result,  the  remote  access 
time  is  tens  to  one  hundred  times  larger  than  the  lo¬ 
cal  access  time,  and  the  mismatch  will  become  even 
worse  in  the  future.  The  processors  spend  most  of 
their  time  waiting  for  remote  operands  because  the  ac¬ 
tual  computation  or  the  local  references  can  be  done 
in  a  flash.  If  we  could  pipeline  the  instruction  exe¬ 
cution,  the  throughput  of  the  system  would  improve 
dramatically.  However,  the  variation  of  remote  access 
time  makes  pipelining  difficult  if  not  impossible. 

Traditional  pipelining  is  “synchronous”  pipelin¬ 
ing  in  the  sense  that  timing  information  is  lmown 
in  advance  so  that  a  reservation  table  synchronizes 
the  resource  allocation.  Data-flow  is  “asynchronous” 
pipelining  in  the  sense  that  only  the  data- dependency 
counts  and  timing  information  is  either  unknown  or 
irrelevant.  Since  this  paper  is  promoting  the  asyn- 
chronity ,  whenever  the  “synchronous”  approach  fails, 
try  its  “asynchronous”  counterpart. 

The  sequential  semantics  of  SIMD  programs  adds 
an  implicit  ancestor  to  every  instruction  on  every  pro¬ 
cessor  -  the  preceding  instruction  on  each  of  these  pro¬ 
cessors.  However,  we  observe  that  the  result  of  an  in¬ 
struction  may  not  be  used  immediately  by  the  next 
instruction.  If  the  current  instruction  is  blocked  (e.g., 
waiting  for  a  remote  operand),  the  execution  oi  the 
next  instruction  can  proceed  without  waiting  for  the 
current  instruction  to  finish.  Before  the  next  instruc¬ 
tion  starts  executing,  the  processor  must  schedule  the 
execution  of  the  current  instruction  and  invalidate  the 
variables  that  may  be  modified  by  the  current  instruc¬ 
tion.  Such  a  problem  was  addressed  many  years  ago 
by  the  Tomasulo  algorithm  [9],  This  algorithm,  used 
by  the  floating  point  unit  of  the  IBM  360/91,  con¬ 
verts  sequential  computation  into  data-flow  computa¬ 
tion  within  a  small  sliding  window  of  instructions. 

The  main  idea  is  to  separate  a  write  operation  of 
a  variable  into  two  phases  -  the  logical  write  and  the 
physical  write.  The  logical  write  is  executed  first  be¬ 
fore  the  content  of  the  write  operation  is  available;  it 
invalidates  the  variable  and  assigns  a  unique  identi¬ 
fier  to  the  content  of  the  write  operation.  From  then 
on,  all  read  requests  to  that  variable  (before  the  vari¬ 
able  is  overwritten)  are  transformed  to  waiting  for  that 
identifier.  The  next  instruction  can  proceed  after  the 
logical  write  without  waiting  for  the  physical  write. 
The  physical  write  is  executed  when  the  content  of 
the  variable  becomes  available;  it  is  sent  to  every  pro¬ 
cessor  waiting  for  the  corresponding  identifier.  Once 
we  adopt  the  two-phase  write,  then  head-of-the-line 
blocking,  which  enforces  sequential  execution,  is  elim¬ 
inated;  at  the  same  time,  the  sequential  semantics 
are  preserved.  Thus  we  see  that  the  techniques  used 
to  compensate  for  the  long  pipeline  stages  of  floating 
point  arithmetic  units  in  sequential  machine  may  now 
be  used  to  compensate  for  the  long  (network)  delays 


due  to  remote  access  in  parallel  processing  systems. 

The  two-phase  write  can  be  easily  implemented  in 
the  memory  history  by  adding  an  extra  busy  bit  to  ev¬ 
ery  outstanding  update.  A  logical  write  sets  the  busy 
bit  to  one,  representing  the  fact  that  an  update  is  tak¬ 
ing  place,  and  the  content  is  not  available.  A  physical 
write  resets  the  busy  bit  to  zero,  representing  the  fact 
that  the  data  in  this  update  is  available.  References 
to  a  busy  update  receive  the  time-stamped  address  of 
the  update  (which  serves  as  the  unique  identifier),  and 
then  get  blocked.  When  a  physical  write  is  executed, 
all  references  to  the  matching  identifier  are  unblocked. 

With  the  two-phase  write,  the  Virtual- Time  Data- 
Parallel  Machine  converts  the  SIMD  computation 
from  control-flow  to  data-flow  (within  a  small  sliding 
window  of  neighboring  instructions).  Data-flow  execu¬ 
tion  recovers  more  threads  of  execution  than  control- 
flow,  which  increases  the  concurrency  and  improves 
the  efficiency  of  the  Virtual-Time  Data-Parallel  Ma¬ 
chine. 

7  Conclusions 

Long  and  unpredictable  remote  access  latency  is  often 
the  performance  bottleneck  of  massively  parallel  com¬ 
puting.  Asynchronous  execution  of  the  Virtual-Time 
Data-Parallel  Machine  provides  one  way  to  relieve  this 
bottleneck.  We  have  proposed  some  minimal  modi¬ 
fications  to  the  architecture  of  the  traditional  data- 
parallel  machine  (e.g.,  the  CM-2),  which  converts  the 
way  it  executes  SIMD  programs  from  synchronous  to 
asynchronous.  We  have  provided  a  basic  foundation 
for  the  understanding  of  both  why  and  how  to  im¬ 
prove  the  efficiency  of  SIMD  programs  by  allowing 
asynchronous  execution. 

Asynchronous  execution  makes  the  machine  more 
MIMD  (Multiple  Instruction  Multiple  Data)-like.  It  is 
nevertheless  a  SIMD  machine  from  the  programmer’s 
point  of  view!  A  more  detailed  discussion  of  these 
ideas  and  a  comprehensive  quantitative  analysis  of  this 
machine  can  be  found  in  [8J. 
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Abstract — In  this  paper,  we  first  study  the  performance  of  job 
scheduling  in  a  large  parallel  processing  system  where  a  job  is 
modeled  as  a  concatenation  of  two  stages  which  must  be  processed 
in  sequence.  We  denote  Pi  as  the  number  of  processors  required 
by  stage  i  and  denote  P  as  the  total  number  of  processors 
in  the  system.  The  service  time  requirement  of  stage  i,  given 
the  required  Pi  processors,  is  exponentially  distributed  with 
mean  1  fni*  Hence,  a  job  can  be  fully  described  by  a  quadruple 
(Pi,  P2,  iii 9  /i-2 )-  Three  service  disciplines  which  can  fully  utilize 
all  processors  in  the  system  are  studied  in  this  paper. 

We  first  consider  a  large  parallel  computing  system  where 
Max(Px,P2)  >  P  »  1  and  Max(Pi,P2)  >>  Min(Pi,P2). 
For  such  systems,  exact  expressions  for  the  mean  system  delay 
are  obtained  for  various  job  models  and  disciplines.  Our  results 
show  that  the  priority  should  be  given  to  jobs  working  on  the 
stage  which  requires  fewer  processors.  We  then  relax  the  lafge 
parallel  system  (i.e.,  P  »  1)  condition  to  obtain  the  mean  system 
time  for  two  job  models  when  the  priority  is  given  to  the  second 
Stage.  Moreover,  a  Scale-up  Rule  is  introduced  to  obtain  the 
approximated  delay  performance  when  the  system  provides  more 
processors  than  the  maximum  number  of  processors  required  by 
both  stages  (i.e.,  P  >  Max(Pi,P2)).  Lastly,  an  approximation 
model  is  given  for  jobs  with  more  than  two  stages. 

Index  Terms — High-concurrency  stage,  low-concurrency  stage, 
parallel  processing  system,  processor  sharing,  Scale-up  rule. 


I.  Introduction 

IN  this  paper,  we  first  consider  a  parallel  computing  system 
in  which  jobs  are  composed  of  two  stages  which  must  be 
processed  in  sequence.  In  each  stage,  up  to  a  given  maximum 
number  of  processors  can  be  used  concurrently  and  the  number 
of  processors  required  by  different  stages  need  not  be  the  same. 
For  instance,  we  can  view  a  job  as  a  program  and  a  stage  in  a 
job  as  a  procedure  in  the  program  where  each  procedure  can 
be  executed  using  a  certain  amount  of  processors  concurrently. 
The  number  of  processors  required  by  a  procedure  depends  on 
how  many  processors  it  can  use  in  its  algorithm.  We  denote 
Pi  as  the  number  of  processors  required  by  stage  i  and  we 
denote  P  to  be  the  total  number  of  processors  in  the  system. 
The  service  time  of  stage  i  given  the  required  Pi  processors 
is  exponentially  distributed  with  mean  l/pi.  Hence,  a  job  can 
be  fully  described  by  a  quadruple  (Pi, P2? Mij M2)*  A  general 
concept  of  this  kind  of  job  model  was  first  proposed  in  [11]. 
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For  easier  notation,  we  denote  the  low-concurrency  stage 
as  the  stage  where  the  number  of  processors  required  equals 
Min(Pi,  P2).  Similarly,  we  denote  the  high-concurrency  stage 
as  the  stage  where  the  number  of  processors  required  equals 
Max(Pi,P2).  Hence,  the  job  models  considered  in  this  paper 
may  be  either  1)  the  low-concurrency  stage  precedes  the  high- 
concurrency  stage  or  2)  the  high-concurrency  stage  precedes 
the  low-concurrency  stage.  These  two  job  models  are  denoted 
as  the  LH  (Low-High)  job  model  and  HL  (High-Low)  job 
model ,  respectively. 

Other  than  considering  the  characteristics  of  jobs,  we  pro¬ 
pose  three  service  disciplines  for  such  systems.  The  basic 
principle  of  the  service  disciplines  studied  in  this  paper  is 
to  fully  utilize  all  processors  such  that  we  do  not  allow  the 
system  to  have  idle  processors  while  there  are  jobs  waiting  in 
the  queue.  To  achieve  this,  the  system  has  to  share  processors 
among  jobs  according  to  a  service  discipline.  Two  of  the 
disciplines  studied  in  this  paper  use  priority  schemes  which 
assign  the  priority  to  a  job  according  to  which  stage  the  job 
is  working  on.  The  third  discipline  follows  a  straight  First 
Come  First  Serve  rule.  Hence  the  disciplines  studied  are  1) 
priority  given  to  jobs  working  on  stage  1  with  preemption,  2) 
priority  given  to  jobs  working  on  stage  2  with  preemption,  and 
3)  FCFS  without  preemption.  More  details  about  job  models 
and  service  disciplines  are  given  in  the  following  section. 

For  the  2-stage  job  model,  we  first  find  the  mean  system  time 
of  LH  and  HL  job  models  under  different  kinds  of  disciplines. 
From  the  obtained  mean  system  time  analysis,  we  then  find 
the  best  service  discipline  to  minimize  the  mean  system  time 
for  both  LH  and  HL  job  models.  These  results  shed  light 
on  designing  the  operating  system  for  a  parallel  computing 
system. 

The  paper  is  organized  as  follows.  In  Section  II,  we  give 
a  detailed  description  to  the  job  models  and  various  service 
disciplines.  In  Section  III,  we  assume  the  number  of  processors 
required  by  the  low-concurrency  stage  is  far  fewer  than 
that  of  the  high-concurrency  stage  (i.e.,  Max(Pi,P2)  >> 
Min(Pi ,  P2))  and  we  further  assume  the  number  of  processors 
in  the  system  is  no  more  than  the  number  of  processors  re¬ 
quired  by  the  high-concurrency  stage  (i.e.,  P  <  Uax(PlP2)). 
The  performance  of  various  combinations  of  job  models  and 
service  disciplines  are  given  and  a  comparison  to  find  the  best 
service  discipline  is  also  provided.  In  Section  IV,  we  drop  the 
Max(Pi,P2)  >>  Min(Pi,P2)  condition  and  find  the  delay 
performance  using  the  discipline  which  gives  the  priority  to 
jobs  working  on  stage  2  with  preemption.  In  Section  V,  we 
propose  a  Scale-up  Rule  which  gives  a  good  approximation 
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result  of  the  mean  system  time  when  the  number  of  processors 
in  the  system  is  more  than  the  number  of  processors  required 
by  the  high-concurrency  stage  (i.e.,  P  >  Max(Pi,P2)). 
Section  VI  contains  an  approximation  model  for  jobs  with 
more  than  two  stages.  The  concluding  remarks  are  given  in 
Section  VII. 


IL  Model  Description  and  Assumptions 

In  our  model,  jobs  arrive  to  the  system  according  to  a 
Poisson  process  with  rate  A.  Whenever  a  job  in  a  stage  requires 
more  processors  than  the  system  currently  can  provide,  this 
job  simply  uses  all  the  processors  available  to  it  with  an 
appropriately  elongated  stage  service  time  such  that  the  work 
done  for  that  stage  remains  unchanged.  That  is,  the  service 
time  for  that  stage  will  still  be  exponentially  distributed  but 
with  a  larger  mean.  For  example,  if  a  job  in  a  stage  can  use 
10  processors  for  one  second  of  work  and  if  there  are  only  5 
processors  available,  it  will  use  these  5  processors  and  require 
two  seconds  of  work  to  complete  its  work.  This  elongated 
stage  service  time  with  5  processors  ensures  the  conservation 
of  the  work  required  in  that  stage.  As  mentioned  earlier,  this 
is  to  fully  utilize  all  the  processors  in  the  system.  An  example 
of  an  LH  job  model  is  shown  in  Fig.  1(a)  and  an  HL  job 
model  is  shown  in  Fig.  1(b)  where  Max(Pi,  P2)/Min(Pi,  P2) 
equals  m. 

The  first  two  service  disciplines  considered  in  this  paper 
are  a  version  of  priority  queueing  with  preemption.  If  the  total 
number  of  processors  occupied  by  higher  priority  jobs  in  the 
system  is  less  than  P ,  those  processors  which  are  not  needed 
by  these  higher  priority  jobs  are  assigned  to  lower  priority  jobs. 
The  assignment  of  priorities  to  a  job  is  based  upon  which  stage 
the  job  is  currently  working  on.  Hence,  when  a  job  finishes 
the  first  stage  and  advances  to  the  second  stage,  the  priority 
ranking  of  this  job  will  be  changed. 

For  instance,  if  the  priority  is  assigned  to  jobs  working  on 
stage  2,  then  a  job  advances  from  stage  1  to  stage  2  will  gain 
a  higher  priority.  Moreover,  in  cases  where  P2  >  Pi  and  jobs 
working  in  stage  2  have  higher  priorities,  a  job  advances  from 
stage  1  to  stage  2  will  achieve  a  higher  priority  and  will  need 
more  processors  to  work  on  stage  2.  In  this  case,  this  job  will 
preempt  processors  from  those  jobs  working  on  stage  1  until 
either  it  has  gained  enough  processors  for  stage  2  or  there 
are  no  more  jobs  working  on  stage  1.  Those  jobs  with  all 
processors  preempted  will  be  pushed  back  to  the  head  of  the 
queue  waiting  for  available  processors.  The  third  discipline 
considered  in  this  paper  does  not  allow  processor  preemption 
and  follows  a  First  Come  First  Serve  rule. 

Furthermore,  for  all  three  disciplines,  if  there  are  more  than 
one  job  wanting  to  share  processors  in  the  same  stage,  we  will 
first  satisfy  the  processor  requirement  of  the  first  job  before 
allocating  processors  to  the  second  job  and  so  on.  This  process 
continues  until  all  processors  are  allocated  to  jobs  or  until 
all  jobs  are  satisfied.  If  there  are  jobs  which  do  not  receive 
processors,  they  stay  in  the  queue  of  that  stage.  Hence,  the 
service  discipline  for  each  stage  is  a  FCFS  scheme  for  all 
disciplines.  The  details  of  the  service  disciplines  follows. 


Processor 


Processor 


(a)  (b) 

Fig.  1.  LH  and  HL  job  models,  (a)  LH  job  model,  (b)  HL  job  model. 


A.  Priority  Given  to  Jobs  Working  on  Stage  One  with  Preemption 

A  job  is  allowed  to  receive  service  as  long  as  there  are 
available  processors  in  the  system  following  a  preemptive 
priority  service  discipline.  A  job’s  priority  decreases  when  it 
finishes  stage  1  and  enters  stage  2.  Under  such  a  preemptive 
priority  scheme,  when  a  job  enters  the  system  while  there 
are  no  available  processors  in  the  system,  it  can  preempt 
processors  from  jobs  working  on  stage  2,  if  there  is  any,  to 
start  servicing  its  stage  1. 

B.  Priority  Given  to  Jobs  Working  on  Stage  Two  with  Preemption 

In  this  discipline,  a  job’s  priority  increases  when  it  finishes 
stage  1  and  enters  stage  2.  Hence,  when  a  job  enters  stage  2 
and  requires  more  processors  than  it  currently  possesses  from 
stage  1,  it  can  preempt  processors  from  jobs  working  on 
stage  1,  if  there  is  any,  to  start  servicing  its  stage  2. 

C.  FCFS  Discipline  without  Preemption 

The  discipline  introduced  here  is  a  FCFS  discipline  where 
preemption  is  not  allowed.  A  job  is  allowed  to  receive  service 
as  long  as  there  are  available  processors  in  the  system.  Further, 
the  processors  occupied  by  a  job  cannot  be  preempted  by  any 
other  jobs.  However,  if  there  are  processors  released  by  a  job, 
those  jobs  already  in  service  which  need  more  processors  have 
a  higher  priority  than  jobs  in  the  queue  to  occupy  the  released 
processors.  A  job  may  release  processors  by  either  advancing 
from  the  high-concurrency  stage  to  the  low-concurrency  stage 
or  by  finishing  stage  2  and  leaving  the  system. 

III.  Performance  Evaluation  of  Different 
Job  Models  Under  Various  Disciplines 

In  this  section,  we  consider  a  large  parallel  computing 
system  where  Max(Pi,P2)  >  P  »  1  and  Max(Pi,P2)  >> 
Min(Pi,P2).  Before  further  discussion,  we  will  first  examine 
the  cases  when  P  <  Max(P1?P2).  When  P  <  Max(Pi,P2), 
the  high-concurrency  stage  can  actually  use  only  P  processors, 
hence  the  quadruple  job  description  should  be  modified.  For 
instance,  if  stage  2  is  the  high-concurrency  stage  (i.e.,  P2  > 
P  >  Pi),  then  the  quadruple  (Pi,  P2,  p  1,  M2)  should  be 
modified  as  (Px,  P,  Mi ,P/- P2M2)*  That  is,  stage  2  can  use  all 
P  processors  for  an  elongated  service  time.  Hence,  we  will 
only  consider  the  case  when  P  =  Max(Pi,P2)  without  loss 
of  generality. 
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-  To  give  an  example  for  the  job  model  discussed  in  this 
section,  the  low-concurrency  stage  can  be  regarded  as  a 
serial  stage  which  requires  only  one  processor  and  the  high- 
concurrency  stage  as  a  parallel  stage  which  can  use  all 
processors  in  the  system.  For  such  a  system,  job  models 
shown  in  Fig.  1(a)  and  (b)  can  be  approximated  as  shown  in 
Fig.  2(a)  and  (b),  respectively  since  the  number  of  processors 
required  by  the  low-concurrency  stage  is  negligible  compared 
to  that  required  by  the  high-concurrency  stage.  For  the  rest  of 
this  section,  we  will  use  these  approximated  job  models  for 
analysis.  It  will  be  shown  in  Section  IV-A  and  Fig.  6  and  when 
Ma-x(Pi,P2)  >  20Min(Pi,P2)>  the  delay  performance  of  the 
job  models  shown  in  Fig.  1  is  very  close  to  the  job  models 
shown  in  Fig.  2,  which  will  be  analyzed  in  this  section. 

For  such  a  system,  the  low-concurrency  stage  requires  a  neg¬ 
ligible  amount  of  total  processors  while  the  high-concurrency 
stage  requires  all  P  processors  in  the  system.  Therefore,  if 
the  high-concurrency  stage  has  a  higher  priority,  then  there 
can  be  at  most  one  job  working  on  the  high-concurrency  stage 
since  it  will  occupy  all  processors  in  the  system.  Under  such 
a  circumstance,  all  other  jobs  in  the  system  will  be  forced  to 
wait  in  the  queue  since  there  are  no  available  processors  left. 

On  the  other  hand,  if  the  low-concurrency  stage  has  a  higher 
priority,  then  all  jobs  working  on  the  low-concurrency  stage 
can  work  concurrently  and  they  will  have  no  effect  on  jobs 
working  on  the  high-concurrency  stage  because  the  processors 
taken  by  jobs  in  the  low-concurrency  stage  are  negligible. 
That  is,  in  this  case,  there  can  be  as  many  jobs  working 
on  the  low-concurrency  stage  and  one  job  working  on  the 
high-concurrency  stage  at  the  same  time. 

In  this  section,  we  will  give  results  of  the  mean  system 
time  of  LH  and  HL  job  models  under  various  disciplines. 
Clearly,  there  will  be  six  models  from  a  combination  of  two 
job  models  and  three  scheduling  disciplines.  These  six  models 
are  classified  into  three  groups  in  according  to  the  service 
disciplines.  Note  that  the  formula  of  infinite  server  queues  are 
approximations  to  the  actual  system. 

A.  Priority  Given  to  Jobs  Working  on  Stage  One  with  Preemption 

In  this  section,  we  give  the  priority  to  jobs  working  on 
stage  1.  For  the  two  job  models,  we  denote  Tlh  and  Thl  as 
the  mean  system  time  for  the  LH  job  model  and  the  HL  job 
model,  respectively.  These  notation  will  be  used  throughout 
this  section. 

Theorem  1:  For  systems  with  priority  given  to  jobs  working 
on  stage  1  with  preemption,  the  Laplace  Transform,  F^L(s), 
of  the  system  time  of  the  HL  job  model  is  shown  at  the  bottom 
of  this  page  where  p\  equals  X/pi  and  G*(s)  is  the  root  of 
the  following  quadratic  equation: 

G*(s)2  -  *  +  A. +  -  G*(s)  +  ^  =  0. 

A  A 


Processor 


Processor 


(a)  (b) 

Fig.  2.  The  approximated  LH  and  HL  job  models  when  P  =  Max(Pi, 
P2)  >>  Min(Pi,P2).  (a)  LH  job  model,  (b)  HLjob  model. 


Proof:  See  Appendix  A. 

Theorem  2:  For  systems  with  priority  given  to  jobs  working 
on  stage  1  with  preemption, 


Thl 


i/mi  +  V  M  2 _ x 

1  -  Pi  Mi(  1  -  Pi)2 


Tlh  =  —  + 
Pi 


y_P2 

1  -  X/fi2  ' 


(2) 

(3) 


Proof:  Equation  (2)  can  easily  be  derived  from  (1).  To  find 
Tlh 7  since  stage  1  has  higher  priority,  stage  1  can  be  regarded 
as  an  M/M/oo  system.  Further,  since  jobs  in  stage  1  occupy 
no  processors  at  all,  stage  2  can  be  regarded  as  an  M/M/1 
queue.  Hence,  TLh  can  be  shown  as  in  (3).  Q.E.D. 

Note  that  A/pi(=  pi)  is  the  system  load  for  the  HL  job 
model  since  only  stage  1  will  contribute  workload  to  the 
system.  Similarly,  A/ p2(=  P2)  is  the  system  load  for  the  LH 
job  model.  These  can  be  seen  from  (2)  and  (3). 


B.  Priority  Given  to  Jobs  Working  on  Stage  Two  with  Preemption 

For  systems  which  give  priorities  to  jobs  working  on  stage  2 
with  preemption,  we  first  derive  the  2: -transform  of  the  number 
of  jobs  in  the  system  as  given  in  the  following  equation. 


P{z) 


P2  -  X 
M2  ~  X Z 


expf  A/p 


1  • 


z  -  1  +  In 


M2  -  A 
|  A  z  -  M2 1 


where  P(z)  is  defined  as  the  ^-transform  of  the  number  of 
jobs  in  the  system.  The  proof  of  (4)  is  provided  in  Appendix  B. 
From  (4),  we  obtain  the  following  theorem. 

Theorem  3:  For  systems  with  priority  given  to  stage  2  with 
preemption, 


rr  i/Ml  +  1/M2 

Tlh=  1-V«' 


(5) 


Thl 


y_pi  l  1 

1  -  x/m  P2 4 


(6) 


nd*) 


_ MlM2(l-Pl)2  _ 

G*(s)[s  +  pi(l  —  Pi)]|>>  +  X  —  X  G*(s)  +  H2]  —  XG*(s)  +  ni\ 


(1) 
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Proof:  From  (4),  we  can  easily  derive  the  mean  number 
of  customers  in  the  system  for  the  LH  job  model  and  we  can 
hence  prove  (5)  using  Little’s  result  [12].  Again,  to  find  Thl  is 
easy  since  stage  1  performs  like  an  M/M/1  queue  and  stage  2 
is  like  an  M/M/oo  queue.  Q.E.D. 


C.  First  Come  First  Serve  Discipline 

The  performance  of  First  Come  First  Serve  discipline  for 
the  HL  job  model  is  the  same  as  when  the  priority  is  given  to 
stage  2  with  preemption  and  the  reason  follows.  Since  jobs  in 
stage  1  possess  more  processors  than  they  will  need  in  stage  2, 
hence  whenever  a  job  finishes  stage  1  and  enters  stage  2,  it 
does  not  need  to  ask  for  more  processors  than  it  already 
possesses.  Actually  it  will  release  processors  to  other  jobs. 
Hence,  all  jobs  in  stage  2  will  always  have  enough  processors. 
Therefore,  the  FCFS  discipline  for  the  HL  job  model  is  the 
same  as  the  case  when  the  priority  is  given  to  stage  2. 

However,  for  the  LH  job  model,  the  exact  result  is  not 
obtained.  Nonetheless,  we  can  easily  show  that  the  perfor¬ 
mance  is  worse  than  the  case  when  priority  is  given  to 
stage  1  with  preemption  by  the  following  reason.  For  FCFS 
discipline,  whenever  there  is  a  job  in  stage  2,  new  arriving 
jobs  cannot  start  receiving  service  for  the  low-concurrency 
stage  since  all  processors  are  occupied  by  old  jobs  in  stage  2. 
Hence,  these  new  jobs  would  have  to  wait  in  the  queue  before 
receiving  service  for  the  low-concurrency  stage.  However,  if 
the  discipline  used  is  to  give  priority  to  stage  1,  then  these  new 
arriving  jobs  can  immediately  start  receiving  service  for  the 
low-concurrency  stage  and  will  not  have  to  wait  in  the  queue. 
Moreover,  jobs  in  stage  1  do  not  really  occupy  processors; 
hence,  it  does  not  interfere  with  jobs  working  on  stage  2. 
Therefore,  the  mean  system  time  for  the  case  when  priority  is 
given  to  stage  1  with  preemption  for  the  LH  job  model  is  better 
than  that  using  the  FCFS  nonpreemptive  service  discipline. 
Applying  the  similar  argument,  we  can  show  that  the  LH  job 
model  performs  better  than  the  case  when  priority  is  given  to 
stage  2  with  preemption.  From  these  conclusions,  we  arrive  at 
the  following  results  for  the  FCFS  system. 


Thl 


1/Mi  +  J_ 
i  -  x/m  H2 


1/Ml  +  1/M2  t  1  1/M2 

1  -  X/H2  LH  Ml  1  -  Va«2  ’ 


(7) 

(8) 


D.  Performance  Comparison  Among  Various  Disciplines 

Comparing  (3),  (5),  and  (8),  we  find  that  for  LH  job  models, 
the  system  which  gives  priority  to  stage  1  with  preemption  has 
the  smallest  mean  system  time.  On  the  other  hand,  for  HL  job 
models,  we  find  that  the  system  which  gives  priority  to  stage  2 
with  preemption  and  FCFS  without  preemption  achieve  the 
smallest  average  system  time  by  comparing  (2),  (6),  and  (7). 
From  the  results  derived  above,  we  arrive  at  the  following 
conclusions: 

1)  FCFS  nonpreemptive  discipline  is  not  the  optimal  dis¬ 
cipline. 

2)  The  best  service  discipline  depends  on  job  models. 


3)  For  the  disciplines  considered  in  this  paper,  the  priority- 
should  be  given  to  the  low-concurrency  stage  which 
requires  fewer  processors. 

Conclusion  3)  above  may  not  seem  obvious.  As  a  known 
fact,  the  best  service  discipline  to  minimize  the  mean  system 
time  is  to  give  higher  priority  to  jobs  with  less  remaining 
service  time  if  that  is  known.  In  our  2-stage  job  model,  jobs 
working  on  stage  2  clearly  are  closer  to  completion  than  jobs 
working  on  stage  1;  hence,  the  priority  should  be  given  to 
jobs  working  on  stage  2  for  both  job  models.  However,  from 
conclusion  3)  stated  above,  we  know  that  the  priority  should 
be  given  to  stage  1  in  the  LH  job  model.  This  is  opposite  to 
our  intuitive  reasoning. 

The  reasons  for  this  can  be  explained  as  follows.  The  prob¬ 
lem  of  giving  the  priority  to  the  high-concurrency  stage  is  that 
by  so  doing,  whenever  there  is  a  job  in  the  high-concurrency 
stage,  all  jobs  in  the  low-concurrency  stage  will  be  forced 
to  wait  since  all  processors  are  occupied  by  the  job  in  the 
high-concurrency  stage.  Once  all  jobs  in  the  high-concurrency 
stage  finish  services,  all  the  processors  will  be  released  to 
jobs  waiting  to  receive  service  for  the  low-concurrency  stage. 
Since  the  number  of  processors  required  by  all  jobs  working 
in  the  low-concurrency  stage  is  small  compared  to  the  number 
of  processors  in  the  system,  therefore,  most  of  the  processors 
will  be  idle.  It  is  this  waiting  time  in  the  low-concurrency 
stage  and  the  inefficiency  of  utilizing  processors  which  causes 
the  poor  performance. 

On  the  other  hand,  if  the  priority  is  given  to  the  low- 
concurrency  stage,  then  all  jobs  in  the  low-concurrency  stage 
would  not  have  to  wait  in  any  case.  Further,  jobs  in  the 
high-concurrency  stage  can  also  receive  service  since  the 
number  of  processors  occupied  by  jobs  in  the  low-concurrency 
stage  is  negligible.  Hence,  by  giving  the  priority  to  the 
low-concurrency  stage  obtains  a  better  delay  performance 
for  all  job  models.  An  example  is  shown  in  Fig.  3  for 
(Pi,P2,Mi,  M2)  =  (1,20, 1, 1)  and  P  —  20.  Fig.  3  shows  that 
the  system  with  priority  given  to  jobs  working  on  stage  1  has 
the  best  performance  while  the  performance  of  the  system  with 
FCFS  discipline  is  very  close  to  it.  In  Fig.  3,  the  mean  system 
time  for  the  FCFS  discipline  is  obtained  by  simulations. 

This  conclusion  can  be  further  extended  to  a  3-stage  job 
model  where  a  job  is  composed  of  a  low-concurrency  stage 
followed  by  a  high-concurrency  stage  followed  by  another 
low-concurrency  stage  as  shown  in  Fig.  4.  From  the  results 
obtained  in  Theorems  2  and  3  and  by  defining  p2  to  be  X/p2, 
we  can  easily  obtain  the  mean  system  delay  for  the  following 
cases. 

Case  1:  If  the  highest  priority  is  given  to  jobs  working  on 
stage  3,  the  next  priority  to  jobs  working  on  stage  2,  and  the 
lowest  priority  to  jobs  working  on  stage  1;  we  have 

Tihl  =  ^  +  1/<12  +  l/«. 

1  -  P2 

Case  2:  If  the  highest  priority  is  given  to  jobs  working  on 
stage  3,  the  next  priority  to  jobs  working  on  stage  1,  and  the 
lowest  priority  to  jobs  working  on  stage  2;  we  have 

TLHl  —  1/ Mi  +  v  ^2-  +  1/^3* 

1  “  p2 
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Fig.  3.  The  comparison  for  three  service  disciplines  for  (Pi ,  P2,  Mi , 
M2)  -  (1,20,1,1)  and  P  =  20. 


Fig.  4.  A  3 -stage  job  model  with  one  high-concurrency  stage  and  two 
low-concurrency  stages. 


Case  3:  If  the  highest  priority  is  given  to  jobs  working  on 
stage  1,  the  next  priority  to  jobs  working  on  stage  2,  and  the 
lowest  priority  to  jobs  working  on  stage  3;  we  have 


Tlhl  =  l/^i  + 


1/^2  +  1/^3 
1  ”  P2 


pK1  —  Pa) 


will  only  consider  the  discipline  which  gives  the  priority  to 
stage  2  with  preemption. 

A.  The  LH  Job  Model 

In  this  subsection,  we  consider  the  situation  when  the  low- 
concurrency  stage  precedes  the  high-concurrency  stage.  This 
job  can  be  modeled  as  shown  in  Fig.  1(a).  The  difference 
between  this  model  and  the  one  given  in  the  previous  section 
is  that  in  this  model  there  can  be  at  most  \m]  jobs  working 
concurrently  on  the  low-concurrency  stage  if  there  is  no  job 
working  on  the  high-concurrency  stage.  As  before,  if  there  is 
a  job  working  on  the  high-concurrency  stage,  all  jobs  in  the 
low-concurrency  stage  will  be  forced  to  wait  in  the  queue. 

To  draw  a  Markov  chain  for  such  a  model,  since  P2  =  mPi, 
we  can  normalize  the  number  of  processors  required  by  stages 
by  using  Pi  =  1  and  P2  =  P  =  m  without  affecting  the 
result.  For  example,  a  job  model  with  Pi  =  2,  P2  =  6,  and 
P  =  6  should  have  the  same  performance  with  the  job  model 
with  Pi  =  1,  P2  =  3,  and  P  —  3.  We  define  p(k,j )  to  be 
the  probability  that  there  are  k  jobs  in  stage  1  and  j  jobs  in 
stage  2  in  the  system.  Since  the  priority  is  given  to  stage  2, 
hence  whenever  j  >  0,  then  all  k  jobs  in  stage  1  have  to  wait 
in  the  queue.  Furthermore,  all  jobs  in  stage  1  will  be  paused 
whenever  j  =  1;  hence  no  more  job  can  advance  from  stage  1 
into  stage  2.  Hence,  the  value  of  j  can  only  be  0  or  1.  If  j  =  0 
and  k  <  m,  then  all  jobs  in  stage  1  are  in  service.  If  j  —  0 
and  k  >  m,  then  there  are  m  jobs  working  on  stage  1  and  the 
other  k-m  jobs  are  waiting  in  the  queue.  Hence,  the  Markov 
chain  of  this  model  can  be  shown  in  Fig.  5. 

Note  that  the  overall  system  load  can  be  expressed  as 
X(mpi  +  /i2)/rapiP2,  hence  the  condition  for  a  stable  system 
is  A  <  +  /z2).  We  obtain  the  ^-transform  of  the 

number  of  jobs  in  the  system  in  the  following  equations. 

For  the  LH  job  model  as  shown  in  Fig.  1(a),  the  ^-transform 
of  the  number  of  jobs  in  the  system  is 
_  M1P2 


P{z)  = 


—X2z  +  m/ii^2  —  mXfiiz  —  X(i2Z  +  X  2z2 

m  —  1 

53  (m  -  °Mfe 


Note  that  in  Case  1,  we  give  the  priority  to  jobs  closer 
to  completion  while  in  Case  2  the  priority  is  given  to  jobs 
working  on  stages  which  require  fewer  processors.  From  these 
results,  we  can  clearly  see  that  Case  2  has  the  smallest  mean 
system  time.  Hence,  our  conclusion  in  this  section  also  works 
for  3-stage  job  models  with  one  high-concurrency  stage. 

IV.  Delay  Analysis  by  Relaxing 
Max(Pi,P2)  >>  Min(P1)P2)  Condition 

In  this  section,  we  relax  the  Max(Pl5  P2)  >>  Min(Pi,  P2) 
condition  as  required  in  the  previous  section.  As  before,  we 
define  m  to  be  Max(Pi,P2)/Min(Pi,P2)  and  denote  P  as 
the  total  number  of  processors  and  assume  P  =  Max(Pi,  P2), 
i.e.,  the  number  of  processors  in  the  system  equals  the  number 
of  processors  required  by  the  high-concurrency  stage.  For 
cases  when  P  >  Max(Pi,P2),  a  good  approximation  will 
be  given  after  the  introduction  of  the  Scale-up  rule,  which 
will  be  described  in  the  following  section.  In  this  section,  we 


where  p(k,  0)  for  0  <  k  <  m  —  1  can  be  obtained  from  (10) 
to  (13).  ' 

y.(m-  k)p(k,  0)  =  . (10) 

fro  ^ 

p(i)0)  =  a(a  +  M2)k  0,0)  (ii) 

^2.°)=A,|(A+^+VllP( A*) 

For  k  >  3 

j *M)  =  +  ^  +  _  li0) 

kpiH2 


\[\  +  2(k-i)m] 


kp  1M2 


P(k-  2,0) 


kp  1H2 


p(k-  3,0). 
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Fig.  5.  Markov  chain  for  LH  job  model. 


The  proofs  of  (9)  to  (13)  are  included  in  the  Appendix  C. 
From  the  above  results,  we  can  easily  find  the  mean  number 
of  jobs  in  the  system.  Using  the  mean  number  of  jobs  in  the 
system  and  Little’s  result  [12],  we  can  find  the  mean  system 
time  as  follows. 


TlH  =  T7 - k)P(k’0) 

\{mnui2  -  mXm  -  Xfi2)  “ 

_  A  -  m/xi  -  ^2 
mfi  1/J2  ~  mXfii  —  Xfi2 

where  p(fc,  0)  for  0  <  k  <  m  —  1  can  be  obtained  from  (10) 
to  (13). 

Here  we  will  compare  the  results  shown  in  (5)  and  (14). 
For  both  cases,  the  job  model  considered  are  the  LH  job 
model.  As  mentioned  in  Section  III,  the  results  obtained  from 
(5)  should  be  a  good  approximation  to  that  from  (14)  when 
m  »  1.  Fig.  6  gives  the  delay  performance  of  an  example 
with  (Pi, P2, /xi, M2)  —  (1, 20, 1,1)  using  (14)  and  another 
curve  derived  from  (5)  using  /ii  =  M2  =  1*  Fig-  6  shows 
that  these  two  curves  are  very  close  to  each  other.  Intuitively, 
we  know  that  the  results  derived  from  (5)  and  (14)  will  be 
even  closer  for  a  larger  m.  This  confirms  our  assumption  in 
Section  III. 

B .  The  HL  Job  Model 

We  study  the  case  when  the  high-concurrency  stage  precedes 
the  low-concurrency  stage  in  this  subsection.  As  defined 
above,  we  define  p(k,  j)  to  be  the  probability  that  there  are  k 
jobs  in  stage  1  and  j  jobs  in  stage  2  in  the  system.  Since  the 
priority  is  given  to  stage  2,  hence  whenever  there  are  j  jobs 
working  in  stage  2  (i.e.,  j  =  m  =  P ),  then  all  k  jobs  in  stage  1 
are  forced  to  wait  in  the  queue.  Therefore,  there  can  be  at  most 
m  jobs  in  stage  2.  Unfortunately,  we  are  not  able  to  solve  the 
general  case  for  an  arbitrary  value  of  m.  However,  the  result 
for  m  —  2  can  be  derived  and  are  given  at  the  bottom  of  this 
page.  The  Markov  chain  for  m  =  2  is  shown  in  Fig.  7. 

To  prove  (15)  is  easy  but  tedious.  We  omit  the  details  which 
can  be  found  in  [7]. 
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Fig.  6.  Comparing  (5)  and  (14)  for  m  =  20. 


V.  Scale-up  Rule 

The  results  derived  in  previous  sections  are  for  cases  when 
the  number  of  processors  in  the  system  equals  the  maximum 
number  of  processors  required  by  the  job.  This  assumption 
simplified  the  Markov  chain  dramatically,  hence  making  the 
analysis  feasible.  However,  this  analysis  is  infeasible  when 
the  number  of  available  processors  is  greater  than  the  max¬ 
imum  number  of  processors  required  by  the  job  (i.e.,  P  > 
Max(Pi,P2)).  In  order  to  solve  this  problem,  we  propose  the 
Scale-up  Rule  which  gives  a  very  good  approximation  result 
without  adding  any  analytical  complexity.  An  application 
of  the  Scale-up  rule  to  the  2-stage  job  model  when  P  > 
Max(Pi,P2)  is  provided  in  this  section.  For  the  rest  of  the 
paper,  all  simulation  results  are  represented  by  90%  confidence 
interval  using  ^-distribution  and  the  approach  used  here  is  the 
“replicate/deletion”  approach  as  indicated  in  [20,  p.  551]. 

There  has  been  considerable  effort  paid  to  the  analysis 
of  the  mean  waiting  time  of  an  M/G/n  queue.  Some  of 
them  provided  bounds  [2],  [8]  while  some  of  them  provided 
approximations  [1],  [4],  [5],  [6],  [15],  [16].  In  this  paper,  we 
focus  our  attention  of  the  results  obtained  in  [13]  and  [14] 


_  (3/x g  +  l1 1M2  +  —  ^2(^1  ~  M1M2  ~  4/z2)^  ~  +  3/ij/i2  +  2/i|) 

HL  ^{X  +  2/i2)(Mi  +  2m2)(^Mi  +  2A  /i2  —  2/xi/i2) 
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which  provided  an  approximation  for  such  a  system.  In  [1] 
and  [3]  this  model  was  extended  to  achieve  a  better  result. 
We  define  Wm/g/u  as  the  mean  waiting  time  for  an  M/G/n 
system.  In  [13]  and  [14]  the  following  approximation  for 
Wm/g /n  was  suggested  as  shown  in  (16).  This  expression 
proved  to  be  a  very  good  approximation  as  some  simulation 
results  in  [7]  demonstrated. 

~  Wm/g/i  w 

WM/G/n  =  TTr -  *  WM/M/n.  (16) 

7  '  Wm/M/1 

A  queueing  model  with  a  varying  number  of  required 
processors  has  been  studied  in  some  previous  works  [17]-[19]. 
However,  in  these  works,  none  of  them  tried  to  make  a  full 
use  of  all  processors.  That  is,  only  our  model  will  not  allow 
cases  where  there  are  available  processors  idle  in  the  system 
while  there  are  also  jobs  waiting  in  the  queue. 

In  this  subsection,  we  will  extend  (16)  to  the  2-stage  job 
model  when  P  >  Max(Pi,P2).  Although  the  number  of 
processors  required  by  a  job  varies  during  its  execution  on 
different  stages,  we  were  surprised,  and  pleased,  to  discover 
that  the  rule  which  applies  to  the  classical  queueing  system  as 
stated  in  (16)  also  applies  in  our  model.  In  other  words,  if  we 
have  to  find  the  mean  waiting  time  for  a  parallel  system  with 
P  >  Max(Pi,P2),  we  will  first  find  the  mean  waiting  time 
for  the  system  with  P'  processors  where  Pf  =  Ma x(PlvP2) 
using  the  method  derived  in  the  previous  sections.  From  the 
result  obtained  for  P'  —  Max(Pi,  P2),  we  apply  the  following 
Scale-up  rule  to  obtain  the  result  for  system  with  P  processors 
(P  >  P'). 

Scale-up  Rule:  Given  a  system  with  P  processors  and  the 
job  model  (Pi, P2, Mi, /x2)T  we  want  to  find  the  mean  waiting 
time  if  P  >  Max(Pi,P2).  We  first  obtain  the  mean  waiting 
time  of  the  system  assuming  the  number  of  processors  in 
the  system,  denoted  as  P',  equals  Max(Pi,P2).  This  result 
can  be  obtained  from  the  previous  section.  We  denote  the 
mean  waiting  time  for  such  a  system  with  Pf  processors  as 
Wm/jm/i(p)>  where  p  is  the  system  load  and  JM  stands  for 
“Job  Model.”  For  the  original  system  with  P  processors  and 
denoting  P/P'  as  n  (n  >  1),  we  define  Wm/jm/u(p)  t0  be 
the  mean  waiting  time  of  the  system  with  P  =  nP'  processors. 


Similarly,  we  define  WM/M/n(p) t0  be  the  mean  waiting  time 
of  an  ordinary  M/M /n  queueing  system  with  mean  service 
time  equals  1/pi  +  1///2-  The  Scale-up  Rule  says  that 


WM/JM/n(p) 


~  Wm/jm/i(p) 
Wm/m/i(p) 


•  WM/M/n(p)- 


(17) 


Since  the  parameters  in  the  right  hand  side  of  (17)  are  all 
known  values,  hence  WM/jM/n(p)  in  the  left  hand  side  can 
be  calculated.  The  approximation  result  is  a  combination  of 
the  use  of  the  exact  result  from  Section  IV  and  the  use  of 
the  Scale-up  rule.  From  the  obtained  mean  waiting  time,  we 
can  easily  obtain  the  mean  system  time  by  adding  into  the 
mean  service  time  which  equals  1/ pi  +  l/ p>2-  Some  examples 
are  given  to  show  how  good  these  approximation  results  are. 
Figs.  8  to  11  give  a  comparison  between  simulation  results 
and  approximation  results  using  the  Scale-up  rule  where  the 
priority  is  given  to  stage  2.  The  approximation  results  are 
obtained  by  first  exactly  finding  the  mean  system  time  for 
P  =  2  using  the  technique  given  in  Section  IV,  then  the 
scale-up  rule  is  used  to  obtain  the  results  for  P  >  2.  Two 
cases  are  given  for  two  different  job  models.  One  is  described 
as  (Pi,  P2,  fix ,  /i-2)  =  (1,2, 1,1)  and  the  other  is  described  as 
(Pi,.P2,/ii,/i2)  =  (2, 1,1,1).  Figs.  8  and  9  show  the  results 
when  P  =  4  and  Figs.  10  and  11  show  the  result  when 
P  =  10.  From  these  figures  we  show  that  the  Scale-up  Rule 
is  indeed  a  very  good  approximation  method. 


VI.  An  Approximation  for  the  General  Cases 

As  we  mentioned  earlier,  to  exactly  evaluate  the  perfor¬ 
mance  of  a  general  case  with  N  (AT  >  2)  stages  is  extremely 
difficult.  In  this  section,  using  the  exact  solution  of  the 
2-stage  model  and  the  scale-up  rule,  we  give  an  approximation 
method  for  any  general  processor-time  task  graph  and  any 
number  of  processors  in  the  system.  In  this  model,  we  give 
higher  priority  to  jobs  closer  to  completion.  That  is,  jobs  in  the 
last  stage  (the  Nth  stage)  have  the  highest  priority  while  jobs 
in  the  first  stage  have  the  lowest  priority.  Again,  preemption  is 
assumed.  The  simulation  of  this  approximation  method  shows 
reasonably  good  results. 
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Fig.  8.  An  approximation  result  for  job  model  (1, 2, 1, 1)  and  P  =  4. 


Given  a  processor-time  task  graph,  we  divide  the  processor¬ 
time  task  graph  into  two  pieces  such  that  the  mean  service 
time  for  each  piece  is  the  same.  In  each  piece,  we  average  the 
work  load  over  its  service  time  to  find  the  average  number  of 
processors  needed.  By  so  doing,  we  achieve  a  2-stage  model 
as  analyzed  in  Section  IV.  We  use  this  modified  2-stage  model 
as  the  basis  for  the  approximation  model. 

If  the  first  stage  requires  fewer  processors  than  the  second 
stage  in  the  modified  model,  we  use  the  result  obtained  in 
Section  IV-A.  An  example  is  shown  in  Fig.  12.  In  Fig.  12(a), 
the  processor  vector  for  this  4-stage  job  is  [3, 1,  3,9]  and  the 
corresponding  time  vector  is  [1/2, 1/2, 1/2, 1/2].  By  dividing 
the  time  vector  into  two  equal  pieces,  the  first  two  stages  of 
Fig.  12(a)  will  be  merged  into  the  first  stage  of  the  modified 
model  as  shown  in  Fig.  12(b).  Similarly,  the  last  two  stages  of 


Fig.  10.  An  approximation  result  for  job  model  (1, 2, 1, 1)  and  P  =  10. 


Fig.  12(a)  will  be  merged  into  the  second  stage  of  the  modified 
model  as  shown  in  Fig.  12(b).  In  order  to  make  the  workload  in 
Fig.  12(b)  to  be  the  same  as  in  Fig.  12(a),  the  processor  vector 
of  Fig.  12(b)  is  [2,6]  and  the  time  vector  is  [1,1].  Figs.  13 
and  14  show  the  approximation  results  for  this  example  given 
12  and  45  processors  in  the  system,  respectively.  From  these 
figures,  we  see  that  the  approximation  results  are  very  close 
to  the  simulation  results. 

If  the  first  stage  requires  more  processors  than  the  second 
stage  in  the  modified  model,  there  is  one  more  step  to  be 
done  in  the  approximation  method.  An  example  is  given  in 
Fig.  15.  In  Fig.  15(a),  the  processor  vector  is  [3, 9, 3, 1]  and  the 
time  vector  is  [1/2, 1/2, 1/2, 1/2].  Applying  the  same  rule  as 
we  did  in  Fig.  12,  we  convert  Fig.  15(a)  into  Fig.  15(b)  such 
that  the  processor  vector  and  the  time  vector  of  Fig.  15(b) 
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Processor 


Processor 


6 

3 


Time 


(b) 


Fig.  12.  (a)  An  example  with  [Pi ,  P2,  P3,  -P4,  Mi  » M4]  =  [3,1, 
3,9, 1/2, 1/2, 1/2, 1/2].  (b)  An  approximation  model  for  (a). 


are  [6,2]  and  [1,1],  respectively.  Since  we  can  analyze  the 
system  only  when  the  number  of  processors  required  in  the 
first  stage  is  exactly  twice  of  that  in  the  second  stage,  we  have 
to  modify  the  modified  2-stage  model  by  using  a  modified 
3-stage  model.  The  first  stage  of  the  modified  3-stage  model 
[as  shown  in  Fig.  15(c)]  is  the  same  as  the  first  stage  of  the 
modified  2-stage  model  [as  shown  in  Fig.  15(b)].  The  second 
stage  of  the  modified  3-stage  model  is  modified  such  that  it 
requires  exactly  half  of  the  processors  required  in  the  first 
stage  (of  the  modified  3-stage  model)  and  the  total  workload 
required  in  the  stage  is  the  same  as  that  of  the  second  stage 


from  the  2-stage  model.  The  third  stage  in  the  3-stage  model  is 
used  to  adjust  the  no-queueing  service  time  such  that  it  is  the 
same  for  the  modified  2-stage  model  and  the  modified  3-stage 
model. 

The  processor  vector  and  the  time  vector  for  Fig.  15(c)  are 
[6,3,0]  and  [1,2/3, 1/3],  respectively.  Although  the  model  in 
Fig.  15(c)  is  different  from  the  model  described  in  Section  IV- 
B,  it  can  be  solved  by  using  the  result  from  Section  IV-B. 
Notice  that  the  third  stage  has  the  highest  priority  and  requires 
no  processors  from  the  system  at  all;  the  existence  of  stage 
three  has  no  impact  on  stages  one  and  two.  Hence,  we 
can  solve  this  problem  by  first  neglecting  the  third  stage  in 
Fig.  15(c)  and  apply  the  results  obtained  from  Section  IV-B; 
from  this  result,  we  add  the  mean  service  time  of  stage  three 
to  it  to  get  the  overall  mean  response  time.  Figs.  16  and  17 
show  the  approximation  results  for  this  example  given  there 
are  12  and  45  processors  in  the  system,  respectively. 


VII.  Conclusions 

In  this  paper  we  are  able  to  find  the  average  system  delay 
of  various  job  models  and  disciplines  when  Max(Pi,  P2)  > 
P  »  1  and  Max(Pi,  P2)  >>  Min(Pi,  P2)  in  a  large  parallel 
processing  system.  Further,  we  achieve  the  following  conclu¬ 
sion  that  the  priority  should  be  given  to  jobs  working  on  the 
low-concurrency  stage  to  achieve  a  better  delay  performance. 
This  priority  assignment  scheme  remains  true  for  3-stage  job 
models  with  a  high-concurrency  stage  preceded  and  followed 
by  low-concurrency  stages.  By  dropping  the  Max(Pi,  P2)  >> 
Min(Px,  P2)  condition,  we  also  obtain  the  mean  system  delays 
for  cases  when  the  priority  is  given  to  jobs  working  on 
stage  2.  A  Scale-up  Rule  is  further  proposed  which  gives  very 
good  approximation  results  for  systems  when  the  number  of 
processors  in  the  system  is  greater  than  Max(Pi,  P2).  Finally, 
an  approximation  model  for  the  general  cases  with  N  (N  >  2) 
stages  is  included  which  shows  reasonably  good  results. 
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Processor 
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Fig.  15.  (a)  An  example  with  [Pi,P2,ft, P4,/jti,/Z2, =  [3,9, 
3, 1, 1/2, 1/2, 1/2, 1/2].  (b)  The  first  step  toward  approximation  model  for 
(a),  (c)  The  second  step  toward  approximation  model  for  Fig.  15(a). 


Appendix  A 


A.  Proof  of  Theorem  1 

Proof:  For  the  HL  job  model,  we  can  regard  stage  1  as 
an  M/M/ 1  queue.  Hence,  we  have  the  Laplace  transform  of 
the  system  time  for  stage  1,  denoted  as  1T(5)>  given  in  [9, 
p.  195] 


y:(s) 


M l(l  ~  Pi) 
s  +  Mi(l  ~  Pi) 


(Al) 


LOAD 

Fig.  16.  An  approximation  result  for  Fig.  15(a)  and  P  —  12. 


LOAD 

Fig.  17.  An  approximation  result  for  Fig.  15(a)  and  P  ~  45. 


equals  the  probability  that  there  are  k  jobs  in  the  system  in 
equilibrium.  Hence, 

Pr[a  departure  from  stage  1  finds  k  jobs  still  in  stage  1] 

=  (1  -PM-  (A2) 


For  stage  2,  the  system  time  can  be  found  by  using  the  delay 
cycle  analysis  given  in  [10,  p.  110].  When  stage  1  is  idle,  jobs 
in  stage  2  have  an  exponential  service  time  distribution  with 
parameter  /i2.  However,  when  stage  1  becomes  busy,  jobs  in 
stage  2  will  all  be  paused  until  stage  1  becomes  idle  again  and 
resume  the  work.  Clearly,  the  system  time  for  stage  2  can  be 
modeled  as  a  delay  cycle.  The  distribution  of  the  initial  delay 
cycle  can  be  found  in  the  following. 

It  is  shown  in  [9]  in  page  176  that  in  an  M/G/l  system, 
the  probability  that  a  departure  finds  k  jobs  in  the  system 


A  departure  from  stage  1  to  begin  receiving  service  in 
stage  2  finds  k  jobs  still  in  stage  1  has  an  initial  delay  cycle 
whose  Laplace  transformed  distribution,  denoted  as  Gq^(s), 
can  be  represented  as 


G 


(k) 

0 


00  = 


M2 

s  +  M  2  ’ 


(A3) 


From  the  delay  cycle  analysis  formula  [10],  we  have  the 
Laplace  transform  of  the  system  time  for  stage  2,  denoted  as 
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Fig.  18.  Markov  chain  for  LH  job  model. 


Y2  0s),  given  as 


Y2*(s)  =£*(!-  pM  •  Gik\s  +  A  -  A G*(s)) 


=  52-(!  -Pi)Pi 


i  [s  +  A  —  XG*(s)  +  hi 


s  +  A  —  XG*(s)  +  H2 

=  M2(l  -  Pi)  •  s  +  A_AG.(s)  +  M2 

S  +  A  —  AG*(s)  +  Ml 
s  —  AG*  (5)  +  mi 

_ _ M1P2O-  -  pi) _ 

G*(s)[s  +  A  -  AG*(s)  +  /z2][s  -  AG*(s)  +  /xi] 

where  p\  equals  A/mi  and  G*(s)  is  the  Laplace  transform  of 
the  busy  period  distribution  of  stage  1. 

From  (Al),  (A2),  and  (A3),  the  Laplace  Transform,  Y£L (s), 
of  the  system  time  of  the  HL  job  model  equals  Y*(s)  ■  y2*(s) 
as  shown  in  (1).  Q.E.D. 

B.  Proof  of  (4) 

Proof:  By  defining  state  (ni,n2)  as  n\  jobs  in  stage  1  and 
n2  jobs  in  stage  2,  we  have  the  Markov  chain  given  in  Fig.  18 
and  the  following  equilibrium  balance  equations: 

(A  +  kpi)p(k,0)  =  A p(k  -—1,0)  +  1)  k  >  1 


(A  +  M2 )p(k  -1,1)  =  kp,ip{k,  0)  +  A p(k  -2,1)  k  >  2 

(B2) 

Ap(0, 0)  =  p2p{  0, 1)  (B3) 

(A  +  /x2)p(0,  l)  =  PiP(l,0).  (B4) 

We  define  p(k)  to  be  the  probability  that  there  are  totally  k 
jobs  in  the  system  and  define  the  following  notation: 

oo  oo 

P(z)  =  '$2p(k)zk,  P0{z)  =  Y^P(k,0)zk, 


pi(z)  =  X+(M)* 


Since  p(k)  =  p(/c,  0)  +p(k  —  1, 1),  it  can  easily  be  shown  that 
P(z)  =  P0(z)  +  zPi(z).  From  (Bl)  and  (B3)  we  have 

\Po(z)  +  m z  f  P0(z)  =  A zP0(z)  +  H2P\{z)-  (B5) 

az 

From  (B2)  and  (B4)  we  have 

(A  +  H2)Pi{z)  =  hi  fz  Po(z)  +  A zPi(z).  (B6) 

From  (B5)  and  (B6)  we  have  the  following  differential  equa¬ 
tion  for  P0(z). 

d 

(\z  -  H2)px  -J-  Po(z)  +  A(A  +  H2-  Xz)Po(z)  =  0. 
az 

Solving  this  linear  differential  equation  we  obtain  the  follow¬ 
ing  explicit  expression  of  Po{z): 

Po(z)  =  C  ■  exp((\/ hi)(z  -  ln|A^  -  /u2|))  (B7) 

where  c  is  a  constant  yet  to  be  determined.  From  (B5)  and 
(B6)  we  have 

Pl(z)  = -^-r-Poiz).  (B8) 

M2  —  AZ 

Combining  (B7),  (B8),  and  P(l)  =  1  we  find  c  to  be 

c=  — — —  exp((— A//xi)(l  —  ln(/x2  —  A))). 

P2 

From  the  above  results  we  have  the  ^-transform  of  the  number 
of  jobs  in  the  system  as  shown  in  (4).  Q.E.D. 

C.  Proof  of  (9)  to  (13) 

Proof:  From  the  Markov  chain  given  in  Fig.  6,  we  have 

(A  +  kfJL\)p{k,  0)  =  A p(k  -1,0)  +  p{k,  1) 

1  <  k  <  m  (Cl) 

(A  +  mpi)p(k ,  0)  =  A p(k  -1,0)  +  M2 p(k,  1) 

k>  m  (C2) 

(A  +  )p(k  -  1, 1)  =  A p(k  -  2, 1)  +  fc/xip(/c,0) 

2  <  k  <  m  (C3) 

(A  +  (12 )p(k  -1,1)=  A p(k  -  2, 1)  +  m(iip(k ,  0) 

k  >  m  (C4) 

Ap(0,0)  =  M2  KM)  (C5) 

(A  +  M2)p(0,l)  =  Mip(l,0).  (C6) 
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We  define  p(k)  =  Pr[k  jobs  in  the  system]  =  p(fc,  0)  + 
p(k  -  1, 1).  We  further  define 

oc  oo 

p{,z)=^v{k)zk,  Po(z)  =  Y^P(k’°)zk’ 

k= 0  k= 0 

oo 

A  (*)-£>(&,  l)zk. 

k—0 

Hence,  we  have  P(z)  =  Po(z)  +  zPi(z).  From  (Cl)  and  (C2) 
we  have 

m  —  1 

(A  +  mp  1  -  Xz)Po(z)  =  p2P\(z)  +  /ii 

fc=0 

•  (m  —  fc)p(fc,0)zfc.  (C7) 
Similarly,  from  (C3)  and  (C4)  we  have 

m— 1 

[(A  +  H2 )z  -  \z2]Pi(z)  =  m^Poiz)  -  hi  ^ 

fc=0 

♦  (m  -  fc)p(fc,  0)zfe.  (C8) 

From  (C7)  and  (C8)  we  have 

PM  =  -\-Po{z).  (C9) 

P>2  ~  AZ 

From  (C7),  (C8),  and  (C9)  we  have  P(z)  as  shown  in  (9). 
The  remaining  job  is  to  find  all  p(k,  0)  for  0  <  k  <  m  —  1. 
Using  (9)  and  P(  1)  =  P0(l)  +  ^i(l)  =  1,  we  can  obtain  (10). 
Equations  (11),  (12),  and  (13)  can  easily  be  obtained  from 
arranging  (Cl),  (C3),  (C5),  and  (C6).  From  (10)  through  (13) 
we  are  able  to  find  all  p(fc,  0)  for  0  <  k  <  m  -  1.  Q.E.D. 
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Abstract — Distributed  systems  frequently  have  large  numbers 
of  idle  computers  and  workstations.  If  we  could  make  use  of 
these,  then  considerable  computing  power  could  be  harnessed  at 
low  cost.  We  analyze  such  systems  using  Brownian  motion  with 
drift  to  model  the  execution  of  a  program  distributed  over  the  idle 
computers  in  a  network  of  idle  and  busy  processors,  determining 
how  the  use  of  these  “transient”  processors  affects  a  program’s 
execution  time.  We  find  the  probability  density  of  a  programs 
finishing  time  on  both  single  and  multiple  transient  processors, 
explore  these  results  for  qualitative  insight,  and  suggest  some 
approximations  for  the  finishing  time  probability  density  that 
may  be  useful. 

Index  Terms —  Brownian  motion,  distributed  processing,  idle 
processors,  performance  analysis,  transient  processors. 


I.  Introduction 

DISTRIBUTED  systems  frequently  have  large  numbers  of 
idle  computers  and  workstations.  If  we  could  use  these, 
then  considerable  computing  power  could  be  harnessed  at  low 
cost.  In  this  paper,  we  model  program  execution  on  a  network 
of  workstations,  some  idle  and  some  not.  Because  we  use 
only  the  idle  time  on  the  processors,  they  are  not  always 
available  for  use.  Hence,  we  call  these  machines  “transient” 
processors.  Using  a  direct  analysis  and  a  cumulative  alternating 
renewal  process  analysis  both  provide  simple  expressions 
for  the  probability  density  of  the  program  finishing  time 
on  a  single  transient  processor.  We  extend  the  cumulative 
alternating  renewal  process  into  a  Brownian  motion  with  drift 
model  of  the  finishing  time  density  on  multiple  processors.  We 
then  examine  the  properties  of  the  finishing  time  probability 
density  to  achieve  some  qualitative  insight  into  the  results. 
Finally,  we  suggest  several  approximations  for  the  finishing 
time  probability  density,  and  discuss  under  what  conditions 
they  can  be  used. 

A.  Background 

Networks  of  computers  are  common  in  business  and 
research  environments  throughout  the  world.  Local  area 
networks,  which  were  originally  introduced  to  ease  data  and 
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device  sharing,  have  grown  in  speed,  sophistication,  and  size  to 
the  point  that  effective  distributed  processing  can  be  performed 
on  them.  These  networks  vary  in  size  from  a  handful  of 
personal  computers  on  a  low-speed  network,  to  networks 
consisting  of  thousands  of  workstations  and  a  variety  of  larger 
machines  on  a  high-speed,  fiber-optic  network.  As  a  typical 
example,  consider  a  network  of  workstations  on  a  high-speed 
network  in  a  research  laboratory.  Not  only  are  there  many 
machines,  well  connected  by  the  network,  but  the  users  are 
likely  to  demand  more  and  more  computing  power  as  the 
applications  grow. 

Networks  of  workstations  have  grown  in  spite  of  theoretical 
considerations  that  would  discourage  them.  It  is  well  known  in 
queueing  theory  [2]  that  a  single  server  of  large  capacity  shared 
by  many  users  provides  better  response  time  than  many  smaller 
servers  with  the  same  total  capacity.  Thus  we  might  expect 
that  a  mainframe  will  provide  faster  service  to  its  users  than  a 
network  of  workstations.  Of  course,  one  may  argue  that  for  the 
same  amount  of  money  one  can  buy  much  more  workstation 
capacity  than  mainframe  capacity,  but  even  then  we  would  like 
to  make  the  best  use  possible  of  our  computing  resources.  This 
implies  that  a  network  of  workstations  would  have  improved 
performance  (overall)  if  the  workstations  were  considered 
part  of  one  processor  pool,  available  for  the  execution  of  all 
programs.  Some  systems,  such  as  Amoeba  [3]  provide  a  pool 
of  processors  . strictly  as  compute  servers.  We,  however,  would 
like  to  retain  the  usual  use  of  workstations  on  the  network  in 
addition  to  considering  them  part  of  the  processor  pool. 

The  solution  to  this  is  idle  time.  On  these  networks,  we 
often  have  the  situation  that  many  of  the  personal  computers 
and  workstations  are  sitting  idle,  waiting  for  their  users,  and 
thus  being  wasted  ([4],  [5]).  If  we  could  recover  this  wasted 
time  for  useful  processing,  then  we  would  have  considerable 
computing  power  available  to  us  at  low  cost.  We  refer  to  these 
processors,  which  are  sometimes  busy  and  sometimes  not,  as 
transient  processors. 

Whether  this  is  technically  feasible  or  not  depends  on  a 
variety  of  factors,  such  as  the  properties  of  the  communications 
medium,  the  properties  of  the  computers,  and  the  statistical 
characteristics  of  the  user  population.1  In  all  systems  of  this 
type,  one  concern  is  that  the  “owner”  of  a  machine  should  not 
see  any  degradation  in  performance  because  of  the  background 
programs.  Any  background  computation  should  be  aborted 

1  There  are  also  other  important  but  nontechnological  factors,  such  as 
people’s  resistance  to  the  use  of  “their”  machine,  that  would  determine  if 
and  how  a  distributed  system  would  be  implemented.  This  paper  does  not 
examine  such  matters. 
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when  user  activity  is  detected,  and  not  restarted  until  the 
system  is  sure  that  the  machine  is  idle. 

There  has  been  much  research  on  systems  that  make  use 
of  idle  workstations  through  load  balancing  and  process  mi¬ 
gration;  we  mention  here  a  few  such  systems.  Most  of  these 
provide  remote  program  execution  facilities  (e.g.,  run  a  com¬ 
piler  at  another  machine  rather  than  on  the  local  workstation), 
rather  than  specifically  supporting  distributed  computations. 
We  make  note  of  those  systems  that  have  been  designed 
specifically  for  distributed  computations. 

Alonso  and  Cova  [6]  discuss  a  load  balancing  system  for 
workstations  in  which  a  workstation  tries  to  execute  a  job 
remotely  if  the  local  processor  load  is  greater  than  some  “High 
Mark,”  and  accepts  jobs  from  other  workstations  if  the  local 
processor  load  is  less  than  some  “Low  Mark.”  This  allows  for 
a  continuum  of  migration  policies. 

At  U.C.L.A.,  the  Benevolent  Bandit  Laboratory  (BBL)  [7] 
runs  distributed  computations  under  MS-DOS  on  a  network 
of  IBM  PC-AT’s.  A  special  shell  runs  on  each  machine, 
and  when  a  machine  is  at  the  operating  system  prompt 
level  (as  opposed  to  running  a  program),  it  is  available  for 
use.  If  someone  starts  to  use  a  machine  currently  part  of  a 
background,  distributed  computation,  the  system  can  select  and 
start  a  replacement  machine  from  the  pool  of  idle  processors. 
Because  the  system  was  also  intended  for  the  investigation 
of  distributed  algorithms,  special  features,  such  as  the  ability 
to  mimic  any  connection  topology,  and  some  distributed 
debugging  facilities,  have  been  built  in. 

Lyle  and  Lu  [8]  describe  a  simple  remote  program  execution 
facility  that  operates  at  the  shell  level,  like  BBL,  rather  than 
the  kernel  level.  This  has  the  advantage  of  being  simple  to 
implement,  yet  still  uses  idle  workstations. 

Condor  ([9],  [10],  [4])  is  a  very  successful  remote  program 
execution  facility  running  on  workstations  at  the  University  of 
Wisconsin.  The  developers  of  the  system  have  made  a  number 
of  useful  measurements  of  workstation  behavior  in  [10]. 

The  Butler  system  [11],  running  on  Andrew  workstations 
at  Carnegie-Mellon  also  provides  remote  program  execution 
facilities.  The  system  uses  this  to  run  gypsy  servers,  which  are 
network  servers  that  run  on  idle  workstations  instead  of  on  a 
fixed  machine. 

Stumm  [12]  discusses  a  remote  program  execution  and  task 
migration  facility  for  the  V  kernel.  His  paper  discusses  various 
issues,  such  as  the  migration  policy,  and  offers  thoughts  on 
using  the  system  for  distributed  computations. 

The  Worm  program  [13]  was  developed  at  Xerox  PARC 
as  an  experiment  in  distributed  processing.  Worms  prowled 
the  network,  collecting  idle  workstations  and  using  them 
to  perform  some  action,  typically  displaying  a  message  or 
running  a  diagnostic  program. 

There  have  also  been  ad-hoc  attempts  to  use  the  idle  time 
on  processors.  Dr.  Tim  Shimeall  [14],  during  his  dissertation 
research,  wrote  a  program  “polite”  that  ran  a  software  analysis 
program  on  workstations  when  no  one  was  logged  in  and 
suspended  the  program  when  the  workstation  was  being  used. 
He  reports  that  he  finished  nearly  10  CPU  years  of  work 
in  about  6  months  on  20  workstations  using  this  program. 
But  again,  this  was  very  much  a  simple  remote  job  exe¬ 


cution  facility,  put  together  out  of  need,  and  it  was  never 
analyzed. 

B.  Outline 

Section  II  discusses  our  model  of  the  network.  Section  III 
gives  a  simple  analysis  of  the  average  time  for  a  program  to 
finish.  Sections  IV  and  V  then  develop  three  models  of  the 
network  and  analyze  them  to  find  the  distribution  of  time  to 
finish  a  fixed  amount  of  work. 

The  first  model,  in  Section  IV-B,  is  a  single  processor  model 
with  general  available  and  nonavailable  times.  We  examine 
the  number  of  nonavailable  periods  interrupting  a  program, 
and  from  this  we  find  the  Laplace  transform  of  the  distribution 
of  the  time  to  finish  a  program  (response  time),  and  then  the 
mean  and  variance  of  the  response  time. 

The  second  model,  in  Section  IV-C,  is  also  a  single  pro¬ 
cessor  model,  but  it  is  analyzed  as  a  cumulative,  alternating 
renewal  process.  We  find  that  the  asymptotic  distribution  of  the 
accumulated  work  (over  a  long  period  of  time)  is  Gaussian, 
with  simple  expressions  for  the  mean  and  variance. 

The  third  model,  in  Section  V-A,  handles  multiple  pro¬ 
cessors  and  views  the  amount  of  work  done  over  time  as 
Brownian  motion  with  drift.  We  scale  the  asymptotic  mean 
and  variance  of  the  accumulated  work  from  the  second,  single 
processor  model  to  the  case  of  M  processors,  and  use  this 
as  the  mean  and  variance  of  the  Brownian  motion  with  drift. 
From  this  we  get  the  probability  density  of  the  time  to  finish 
a  fixed  amount  of  work  on  M  processors.  The  mean  and  the 
variance  agree  very  closely,  for  M  =  1,  with  the  first  model. 

Finally,  Section  VI  contains  the  conclusions.  These  three 
models  offer  an  approach  to  predicting  performance  of  dis¬ 
tributed  programs  on  transient  processors.  By  relaxing  some  of 
our  assumptions,  more  sophisticated  models  could  be  derived 
from  those  described  here. 

II.  The  Model  of  the  Network  and  the  Workload 

A.  The  Network 

Assume  that  we  have  a  network  of  M  identical  processors, 
each  of  which  has  a  capacity  to  complete  one  minute  of  work 
per  minute.  A  processor  alternates  between  a  nonavailable 
state  (signified  by  n  or  na),  when  the  owner  is  using  it  (e.g., 
typing  at  the  keyboard),  and  an  available  state  (signified  by 
a  or  av),  when  it  is  sitting  idle.  The  lengths  of  nonavailable 
periods  are  independent  and  identically  distributed  (i.i.d.)  ran¬ 
dom  variables  from  distribution  N(t),  with  mean  tn,  variance 
and  corresponding  density  n(t);  we  allow  any  general 
distribution  for  N(t),  unless  otherwise  specified.  Likewise, 
available  periods  are  i.i.d.  random  variables  from  a  general 
distribution  A(t)  with  mean  ta,  variance  o\,  and  density 
a(t).  The  available  and  nonavailable  periods  are  mutually 
independent. 

B.  The  Distributed  Program  Workload 

We  model  a  program  as  consisting  of  multiple  stages  of 
work,  each  of  which  must  be  completed  before  the  start  of 
the  next,  and  each  of  which  represents  a  deterministic  amount 
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Fig.  1.  Execution  time  profile  of  an  algorithm. 


of  work  (Fig.  1).  The  time  to  finish  a  program  is  the  sum 
of  the  times  to  complete  the  individual  stages.  We  assume 
that  the  time  to  finish  a  stage  depends  only  on  the  amount 
of  work  in  that  stage,  and  is  independent  of  the  other  stages. 
This  means  that  the  probability  distribution  of  the  total  time 
to  finish  a  program  is  the  convolution  of  the  distributions 
of  the  individual  stage  finishing  times.  Assuming  that  the 
network  characteristics  do  not  change  during  the  execution 
of  the  program,  then  from  the  analysis  of  the  time  to  finish 
a  single  stage  requiring  W  minutes  of  work  we  can  find  the 
finishing  time  probability  density  function  of  all  other  stages, 
and  from  there,  the  finishing  time  probability  density  function 
of  the  program  as  a  whole. 

We  make  the  simplifying  assumption  that  the  work  in 
any  stage  is  infinitely  divisible — it  can  always  be  divided 
evenly  among  all  available  processors.  Note,  however,  that  this 
assumption  does  not  always  obscure  program  behavior.  Some 
programs,  such  as  the  simulators  used  for  the  models  of  this 
paper,  are,  in  fact,  composed  of  very  many  independent  tasks, 
and  work  is  always  available  for  any  idle  processor.  In  such 
cases,  this  assumption  captures  the  program  behavior  and  is 
not  a  simplification  at  all.  In  addition,  in  a  system  with  multiple 
users  on  many  machines,  the  aggregation  of  independent  jobs, 
each  with  a  number  of  tasks,  from  many  users  will  yield  an 
overall  workload  that  tends  to  look  like  many  independent 
tasks  (or,  at  least,  enough  tasks  to  keep  idle  machines  busy), 
and  this  would  fit  well  with  our  assumption.  We  will  explore 
this  more  in  future  work. 

Another  simplifying  assumption  we  make  is  to  ignore 
overhead  that  occurs  in  a  real  system  (e.g.,  communication 
delays,  processing  delays),  and  thus  our  model  provides  an 
optimistic  bound  on  system  performance.  Some  techniques  for 
removing  this  assumption  are  mentioned  in  [1]. 

C.  What  We  Seek 

The  purpose  of  this  paper  is  to  find  /(f),  the  probability 
density  function  (pdf)  of  a  program’s  finishing  time  (i.e., 
response  time).  Also  of  interest  are  its  mean,  /,  and  its 
variance,  cr'j. 


In  the  process  of  finding  these,  we  will  need  another 
function,  namely,  the  pdf  of  the  amount  of  work  accumulated 
(i.e.,  completed)  by  a  processor  or  network  of  processors 
over  time.  We  denote  this  by  y(u  \  t ),  the  probability  density 
that  after  t  minutes  of  time  have  elapsed,  the  processor  (or 
processors)  under  consideration  has  accumulated  u  minutes 
of  work.  This  function  has  mean  E [y{u  |  t)]  and  variance 
Var[j/(u  |  t)}. 

D.  Notation 

We  use  the  abbreviations  “PDF”  to  stand  for  “probability 
distribution  function.”  Typically  we  use  capital  letters  for  a 
PDF  and  the  corresponding  lower  case  letters  for  a  pdf.  If 
F(x)  is  a  PDF,  then  its  pdf  is  f(x)  =  (d/dx)F(x). 

E.  Example  Parameters 

Mutka  and  Livny,  [10],  made  actual  measurements  of  a 
network  of  transient  processors,  and  they  developed  models 
for  the  available  and  nonavailable  period  densities  to  fit  these 
measurements.  From  their  results,  we  derive  two  examples 
that  we  use  throughout  this  paper. 

The  model  they  used  for  the  available  time  PDF  was  a 
3-stage  hyperexponential  distribution: 

A(t)  =  P  [length  of  an  available  period  <  t] 

=  0.33(1  -  e_(t/3))  +  0.4(1  -  e~(t/25)) 

+  0.27(1  -  e_(t/300))  t>  0  (1) 

which  has  mean  ta  =  91  min  and  variance  u2  =  40225  min2. 

For  the  nonavailable  time  distribution,  N(t)  =  P[length 
of  a  nonavailable  period  <  t],  they  used  a  shifted  2-stage 
hyperexponential  distribution: 

M,  .  _  /  0.7(1  -  e-WV)  +  0.3(1  -  e-(*/55))  if  t  >  7 

W  \0  if  0  <  f  <  7 

(2) 

which  has  mean  tn  =  31.305  min  and  variance  er^  = 
2131.83  min2.  The  7  min  shift  in  the  distribution  arises 
because  a  processor  was  not  declared  idle  until  7  idle  minutes 
had  elapsed. 

We  use  Mutka  and  Livny ’s  distributions  wherever  possible 
in  our  examples,  but  frequently  we  assume  exponentially 
distributed  available  and  nonavailable  periods;  at  such  times, 
we  take  the  means  of  these  exponential  distributions  to  be 
the  numbers  given  above.  The  use  of  exponential  distributions 
instead  of  hyperexponential  distributions  will  not  affect  any 
means  that  we  derive,  but  any  variances  that  we  find  will  be 
lower  than  if  we  had  used  Mutka  and  Livny ’s  distributions. 

Regardless  of  which  distributions  we  use,  we  take  W  = 
1000  min  and  M  —  1  for  single  processor  examples,  and, 
W  =  10000  min  (almost  7  days)  and  M  =  100  for  most 
multiple  processor  examples.  The  reason  for  the  large  values 
of  W  is  explained  below. 

F.  Related  Work 

One  approach  to  analyzing  a  single  processor  system  is  to 
use  queueing  with  vacation  as  a  model.  In  such  a  system,  the 
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queueing  server  is  subject  to  randomly  occurring  stoppages 
lasting  for  random  amounts  of  time.  There  are  many  varieties 
of  such  systems  depending  upon  what  restrictions  we  put  on 
the  vacations  (see  [15]  for  a  survey).  For  our  model,  we  require 
vacations  to  occur  preemptively  and  at  any  time  (as  opposed 
to  vacations  that  occur  only  when  the  processor  is  busy).  The 
earliest  analysis  of  such  systems  is  in  [16],  and  later  in  [17], 
but  Gaver  ([18])  derives  the  Laplace  transform  for  the  finishing 
time  by  assuming  exponentially  distributed  available  periods 
and  generally  distributed  nonavailable  periods.  Federgruen 
and  Green  ([19])  extend  the  analysis  to  generally  distributed 
available  periods,  but  they  find  only  the  first  two  moments  of 
the  finishing  time,  and  not  its  distribution. 

For  both  single  and  multiple  processor  systems,  performabil- 
ity  analysis  offers  an  alternative  approach  to  ours.  Performabil- 
ity  analysis  ([20],  [21],  [22])  combines  dependability  analysis 
with  performance  measures.  A  system  is  modeled  as  a  Markov 
or  semi-Markov  process  in  which  each  state  of  the  process 
represents  a  possible  configuration  of  the  system  with  respect 
to  failed  and  working  components.  In  a  multiprocessor,  for 
example,  the  state  could  be  the  number  of  working  processors. 
This  state  represents  the  reliability  aspect  of  performability 
analysis.  Associated  with  each  state  is  a  reward  representing 
either  the  performance  measure  of  interest,  or  a  quantity  that 
may  be  used  to  calculate  the  performance  measure.  Applied  to 
the  models  of  this  paper,  the  state  of  the  system  represents  the 
number  of  available  processors,  and  the  reward  for  each  state 
is  the  amount  of  available  computing  power  (in  operations 
per  time  unit)  in  that  state.  Our  goal  would  be  to  find 
the  distribution  of  time  it  takes  for  the  accumulated  reward 
(accumulated  work)  to  reach  a  threshold  representing  the 
amount  of  work  a  program  requires. 

A  number  of  researchers  have  examined  the  problem  of 
finding  the  distribution  of  accumulated  reward  (also  known 
as  the  performability  distribution).  Nonrepairable  systems,  in 
which  a  nonavailable  processor  cannot  become  available  (“be 
repaired”),  are  easiest  to  analyze,  but  are  not  applicable  to  our 
problem.  For  repairable  systems,  some  researchers  have  found 
methods  to  get  the  moments  of  the  performability  distribution 
([23],  [24],  [25]),  and  other  researchers  have  expressed  the 
performability  distribution  as  a  double  Laplace  transform 
([24],  [26],  [23],  [27],  [28],  [29]).  In  the  latter,  typically 
the  transform  can  be  inverted  analytically  on  one  variable, 
then  inverted  numerically  on  the  other  variable,  although 
[24]  perform  the  inversion  entirely  numerically.  In  [30],  de 
Souza  e  Silva  and  Gail  apply  randomization  techniques  to 
numerically  find  the  distribution  of  performability  over  a  finite 
time  interval. 

However,  we  do  not  want  this  distribution  of  accumulated 
reward  itself,  but  we  would  like  to  find  the  distribution  of  time 
for  the  accumulated  reward  to  reach  a  threshold.  Because  our 
reward  represents  accumulated  work,  this  latter  distribution 
is  equivalent  to  the  distribution  of  time  to  complete  a  job. 
In  one  of  the  early  papers  on  performability,  Beaudry  [20] 
defines  this  quantity,  but  never  derives  it  for  systems  of 
interest  to  us.  Kulkami,  Nicola,  Smith,  and  Trivedi  [27]  find 
a  double  transform  of  the  job  completion  time  distribution  in 
terms  of  a  system  of  equations,  and  provide  an  algorithm  for 


numerical  inversion  of  the  transform.  The  difference  between 
this  previous  work  and  the  work  contained  in  this  paper  is  that 
our  model  takes  a  different  view  of  the  problem  and  involves 
simple,  approximate  analytical  expressions,  with  no  numerical 
techniques  being  necessary. 

Finally,  much  of  the  work  on  performability  concerns  itself 
with  transient  analysis,  because  in  a  well-designed,  fault- 
tolerant  system,  faults  will  be  quite  infrequent,  and  steady 
state  analysis  can  be  misleading.  Our  situation  is  the  opposite, 
with  “faults”  (processor  nonavailability)  occurring  frequently, 
and  we  expect  many  “faults”  to  occur  before  a  program 
finishes  execution.  Iyer  et  al  [26]  note  that  asymptotically, 
after  a  long  enough  time  that  every  state  of  the  system  has 
been  entered  many  times,  the  performability  distribution  is 
normally  distributed,  and  the  mean  and  standard  deviation 
of  this  distribution  can  be  found  by  solving  sets  of  linear 
equations.  In  this  paper,  we  come  to  the  same  conclusion  about 
the  normality  of  the  asymptotic  distribution  by  starting  from 
the  analysis  of  a  single  processor,  as  discussed  in  Sections  IV- 
C  and  V-A,  and  in  doing  so  we  find  simple  expressions  for 
the  mean  and  variance  of  this  distribution  [(18)  and  (19)]. 

One  appealing  aspect  of  performability  models  is  that  by 
setting  the  rewards  appropriately  for  each  system  state,  the 
model  could  capture  some  of  the  inefficiencies  that  occur  as  a 
program  executes  on  varying  numbers  of  processors,  examples 
of  which  are  the  additional  communication  overhead  involved, 
or  the  program’s  inability  to  use  all  available  processors.  These 
rewards,  however,  would  be  specific  to  that  particular  program. 
Ammar  and  Islam  [24]  have  done  this  using  Generalized 
Stochastic  Petri  Nets  to  generate  the  reliability  model  for  a 
specific  architecture,  and  then  trace-driven  simulations  of  a 
specific  algorithm  to  determine  the  reward  for  every  state 
of  the  reliability  model.  The  reward  is  the  inverse  of  the 
total  execution  time  of  the  computation,  given  the  system 
is  in  a  particular  state.  Because  there  may  be  many  states 
in  the  reliability  model,  and  hence  many  simulation  runs 
required,  their  model  is  potentially  quite  time  consuming. 
We  are  investigating  methods  by  which  we  can  capture  the 
interaction  of  the  algorithm  with  the  architecture  within  our 
Brownian  motion  model. 

III.  Time  to  Finish  a  Program:  Quick  Means 

With  simple  reasoning,  we  can  find  the  mean  cumulative 
work  over  time,  E [y(u  1 1)]9  and  the  mean  finishing  time  for  a 
program,  /.  Over  a  long  period  of  time,  a  processor  is  available 
a  fraction  of  the  time  pa  =  ta/(ta  +  tn),  and  nonavailable  the 
remaining  fraction  of  the  time,  pn  =  tn/(ta  +  tn).  Over  a 
period  of  t  seconds,  the  amount  of  work  a  processor  does  is 
equal  to  the  fraction  of  time  it  is  available  (assuming  that  there 
is  a  large  amount  of  work  to  do,  and  that  the  processor  never 
goes  idle),  and  thus  we  have  the  equilibrium  approximation 

Efo(«  I  t)}  =  -±-  t.  (3) 

&a  i  In 

Similarly,  it  takes  (ta  +  tn)/ta  seconds  to  accumulate  one 
second  of  work,  so  the  average  finishing  time  for  a  program 
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-  on  a  single  processor  is 


f  —  t  *  tn  ly 

ta 


(4) 


In  an  M  processor  network,  we  accumulate  work  M  times 
faster  and  thus  finish  in  (1/M)th  of  the  time.  Therefore: 

taM  ; 


E [y(u  1 1)}  = 


(ta  "I"  tn ) 


7  =  h  tin  w 

1  taM 

We  will  use  these  as  a  check  on  the  other  analyses. 


(5) 

(6) 


Fig.  2.  Time  for  one  node  to  finish  W  units  of  work. 


IV.  The  Distribution  of 
Finishing  Time  for  One  Processor 

A.  Introduction 

We  would  now  like  to  find  the  pdf  of  the  finishing  time  for 
any  particular  algorithm  or  program.  This  is  also  known  as 
the  first  passage  time ,  the  first  time  at  which  the  accumulated 
work  is  greater  than  some  particular  amount  (W  minutes 
in  our  case).  In  this  section,  we  analyze  the  behavior  of  a 
program  on  a  single,  transient  processor  using  two  methods. 
The  direct  method,  in  Section  IV-B,  yields  f(t)  for  general 
distributions,  but  unfortunately,  it  does  not  extend  to  multiple 
processors  because  the  analysis  depends  upon  the  system  being 
either  fully  available  or  fully  nonavailable.  In  a  multipro¬ 
cessor  system,  we  usually  have  partial  availability:  some  of 
the  machines  are  available  and  some  are  not.  We  do  not 
derive  the  pdf  of  accumulated  work,  y(u  \  t),  using  the  direct 
analysis.  However,  by  analyzing  the  problem  as  a  cumulative, 
alternating,  renewal  process  (Section  IV-C),  we  do  find  the 
asymptotic  probability  density  of  the  accumulated  work  as 
t  — *■  oo,  and  we  use  this  in  the  Brownian  motion  analysis 
of  the  next  section. 


arrive  in  a  W  minute  period  starting  at  the  beginning  of  an 
available  period,  and  N *(s)  is  the  Laplace  transform  of  the 
length  of  a  nonavailable  period.  Note  that  we  can  also  get  the 
finishing  time  density  directly  using  the  same  technique,  but 
this  usually  results  in  an  open  expression;  details  are  available 
in  [1].  From  the  Laplace  transform,  we  can  derive  the  finishing 
time’s  mean: 

1  =  wk±h.t  (8) 

t'a 

and  variance: 

~  anP  +  (9) 

where  p  and  o %  are  the  mean  and  variance  of  p(k  |  W). 

The  central  limit  theorem  assures  us  that  when  we  sum 
many  independent  random  variables,  the  resulting  distribution 
tends  toward  a  normal  distribution.  We  may  note  an  important 
consequence  of  this:  asymptotically,  for  large  W  compared  to 
ta  and  tn,  the  finishing  time  density  is  normal  with  mean  and 
variance  given  in  (8)  and  (9). 

1 )  Example:  Exponential  Distributions:  If  available  and 
nonavailable  periods  are  exponentially  distributed,  then  the 
finishing  time  density  is 


B.  Direct  Analysis  of  a  Single  Processor 

We  make  a  direct  analysis  of  the  single-processor  problem 
by  counting  the  number  of  nonavailable  periods  that  interrupt 
our  program  before  it  completes.  If  our  program  starts  at  the 
beginning  of  an  available  period,  as  shown  in  the  middle  of 
Fig.  2,  it  will  finish  at  time  W +Ta,  where  Ta  is  the  additional 
time  the  program  spends  in  the  system  because  of  interrupting 
nonavailable  periods. 

Because  we  must  finish  the  program  in  an  available  period, 
and  because  we  assume  work  starts  at  the  beginning  of  an 
available  period,  then  none  of  the  nonavailable  periods  are 
truncated,  and  it  is  relatively  easy  to  analyze  the  total  length 
of  the  nonavailable  periods  during  the  time  Ta  +  W.  By 
examining  the  arrival  process  for  nonavailable  periods,  we  find 
that  the  Laplace  transform  of  the  finishing  time  density  is 

oo 

F*(S)  =  e-^^[N*(s)]fep(fc|W) 

k- 0 

=  e"WsP(N*(s))  (7) 

where  P(z)  =  p  (k  \  W)zk  is  the  z-transform  of  p(k  \ 

W),  the  probability  that  k  zero-length  nonavailable  periods 


m  = 

(  e-w^  if  t  =  w 


( (1  /tn)qt-w)/tK)h 

;fc=ll  (fc— 1)! 

if  t  >  W 


o-(t- 


(10) 


This  was  derived  using  a  direct  analysis  detailed  in  [1].  Fig. 
3  illustrates  this  density  using  Mutka  and  Livny’s  parameters 
to  select  the  means  ta  =  91  min  and  tn  =  31.305  min  of  the 
exponential  distributions.  The  mean  finishing  time  is 

7  =  wta  +  tn  (ii) 

la 

and  its  variance  is 


2  t2nW 


(12) 


The  finishing  time  density  looks  similar  to  a  nofmal  density, 
but  it  is  asymmetrical.  For  t  less  than  the  mean  first  passage 
time,  it  rises  sharply  to  a  peak  before  the  mean,  then  drops 
into  a  stretched-out  tail  for  large  t.  This  asymmetry  is  more 
apparent  for  small  W,  and  as  W  grows,  the  density  becomes 
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Fig.  3.  Probability  density  of  finishing  time  for  direct  analysis. 


more  similar  to  a  normal  density,  as  one  would  expect  from 
the  central  limit  theorem. 

Unfortunately,  the  analysis  of  this  section  will  not  extend  to 
multiple  processors  because  it  depends  upon  the  system  being 
fully  available  or  fully  nonavailable.  With  multiple  processors 
the  system  is  usually  partially  available.  However,  it  is  possible 
to  get  an  approximation  to  the  finishing  time  density  for  the 
case  of  multiple  processors  by  studying  a  different  process, 
namely  the  accumulated  work.  Thus,  in  the  next  section  we 
use  a  cumulative,  alternating  renewal  process  to  analyze  the 
accumulated  work  on  a  single  processor,  then  in  Section  V-A 
we  apply  this  analysis  to  the  multiple  processor  case. 


Available 
•  Not  Available 


4- 


J - IT- 


Fig.  4.  Cumulative  renewal  process  for  a  processor. 


For  exponentially  distributed  available  and  nonavailable 
periods,  the  mean  remains  the  same  and  the  variance  may 
be  rewritten  as 

2  t2t2 

Var [y(u  |  *)]  *  ^  (15) 

We  could  use  this  approach  to  get  a  (possibly  very  complex) 
expression  for  the  distribution  of  accumulated  available  time; 
this  was  also  done  in  [1]  based  on  the  techniques  of  the 
previous  section.  Of  more  interest  to  us  is  the  fact  that  the 
asymptotic  pdf  of  the  accumulated  available  time  (for  large 
t )  is  normal  with  mean  and  variance  given  by  (13)  and  (14). 
This  distribution  is  based  on  a  sum  of  random  variables  (the 
available  periods),  and  the  Central-Limit  Theorem  [32]  tells  us 
that  as  the  number  of  random  variables  in  the  sum  approaches 
infinity,  this  pdf  converges  to  a  normal  pdf.  Thus  the  pdf  of  the 
accumulated  work,  y(u  \  t),  is  well  approximated  by  a  normal 
pdf  if  many  renewal  periods  have  occurred,  or  equivalently, 
if  t  ta  4-  tn.  This  normal  pdf  will  be  the  basis  of  the 
Brownian  motion  model  in  the  next  section,  which  will  lead 
us  to  an  approximation  for  the  distribution  of  finishing  time 
with  multiple  processors. 

V.  The  Distribution  of  Finishing 
Time  for  M  Processors 


C.  Cumulative,  Alternating  Renewal  Theoretic  Analysis 

Here  we  reexamine  a  single  processor  using  a  cumulative, 
alternating  renewal  process.  Cox,  in  his  book  on  renewal 
processes  [31],  discusses  this  type  of  process,  and  we  make 
use  of  his  analysis. 

We  can  form  a  renewal  process  from  the  alternating  states 
of  a  transient  processor  by  letting  a  renewal  period  be  a 
nonavailable  period  followed  by  an  available  period.  In  Fig. 
4  the  heavy  dots  indicate  the  beginning  of  each  renewal 
period.  The  durations  of  the  available  periods  are  i.i.d.  random 
variables  from  a  general  distribution,  as  are  the  the  durations  of 
the  nonavailable  periods,  and  the  lengths  of  the  available  and 
nonavailable  periods  are  mutually  independent.  Using  Cox’s 
results,  we  find  that  the  distribution  of  accumulated  available 
time  has  mean  and  variance: 

%(« 1 1)\ «  (i3) 

la  ~r  In 

re 2+2  _i_  j.2  _2 

Var [y(u  \  t))  «  (14) 

Cox  derives  these  by  ignoring  the  available  time  accumulated 
in  the  current  available  period  if  one  is  in  progress  at  time  t. 
However,  as  time  goes  to  infinity,  the  asymptotic  distribution 
of  this  approximation  has  the  same  properties  as  the  true 
accumulated  available  time.  Note  that  the  approximation  for 
E [y(u  |  t)]  leads  to  a  mean  which  corresponds  exactly  to  the 
equilbrium  approximation  of  (3)  in  Section  III. 


A.  Brownian  Motion  Approximation 

Brownian  motion  concerns  the  random  movement  of  a 
particle  through  space.  A  stochastic  process,  Q(t ),  that  de¬ 
scribes  Brownian  motion  has  two  basic  properties.  The  first 
is  that  Q(t)  has  independent  increments :  Q(t i)  -  Q(to)  and 
Q(f3)  -  Q(t 2)  are  independent  for  0  <  to  <  t\  <  t<i  <  £3  < 
00.  Movement  of  the  particle  in  one  interval  is  independent 
of  its  movement  in  another  interval.  The  second  property  is 
that  each  increment  in  the  process,  Q(U+i)  -  Q(U)  for  all 
i  >  0,  is  normally  distributed  with  a  mean  and  variance 
proportional  to  U+i  -  U.  If  the  normal  distribution  has  mean 
0  and  variance  equal  to  ti+ 1  —  U ,  then  the  process  describes 
standard  Brownian  motion ,  which  is  also  known  as  a  Wiener 
process .  Brownian  motion  with  a  nonzero  mean  is  known  as 
Brownian  motion  with  drift. 

In  our  model,  we  let  the  stochastic  process  Q(t)  represent 
the  amount  of  work  accumulated  by  a  network  of  M  transient 
processors  up  to  time  t.  In  Section  IV-C,  we  found  that  over  a 
long  period  of  time  (much  longer  than  ta  +  tn),  the  amount  of 
work  done  by  one  transient  processor  is  asymptotically  normal 
with  mean  and  variance  given  in  (13)  and  (14),  respectively. 
If  we  have  a  network  of  M  such  processors,  and  all  the 
processors  are  assumed  to  be  independent  and  identical,  then, 
asymptotically,  the  amount  of  work  done  by  time  t  is  the 
sum  of  M  independent,  (approximately)  normally  distributed 
random  variables,  and  this  is  itself  (approximately)  normally 
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.distributed.  The  mean  amount  of  work  done  by  time  t  is 


ta  +  tn 


Mt  =  paMt 


and  the  variance  of  the  amount  of  work  done  by  time  t  is 

rr2i2  ,  -2  j.2 

J2  _  u  aLn  <  unla  j\/T^  /i  n\ 

°  -  «.+<„)»  UL  <17) 

Thus,  Brownian  motion  with  drift  is  a  natural  model  of  our 
system.  From  p  and  a1  above,  we  define  b  and  a\  (6  signifies 
Brownian),  the  mean  and  variance  of  the  amount  of  work 
accumulated  per  unit  time: 

b  =  -^-M  =  PaM  (18) 

('a  I  £71 

2x2  ,  n2f2 

_2  _  uavn  '  unva  jyr  /i  n\ 

-  —« + M-  (19) 

We  use  b  and  a\  later  in  this  analysis.  Note  that  b  agrees  with 
the  average  accumulated  work  that  we  found  in  Section  III. 

We  must  still  assure  ourselves  that  our  stochastic  process 
indeed  has  independent  increments.  On  a  short  term  scale, 
this  is  clearly  not  true.  Two  consecutive  one-minute  intervals 
are  likely  to  have  the  same,  or  at  least  similar,  numbers 
of  available  processors  in  both  intervals,  and  hence  similar 
amounts  of  work  accumulated  in  those  intervals.  However, 
in  two  one-minute  intervals  separated  by  several  hours,  the 
number  of  available  processors  is  quite  unrelated  (unless  the 
network  has  some  very  unusual  statistical  properties),  and  the 
work  accumulated  in  one  interval  is  quite  independent  of  the 
work  accumulated  in  the  other  interval.  Thus  we  conclude  that 
the  Brownian  motion  model  is  reasonable  only  over  a  long 
span  of  time,  and  we  insure  this  by  specifying  that  ta  W 
and  tn  <C  W.  Note,  too,  that  we  are  using  the  asymptotic 
results  of  Section  IV-C,  and  these  are  valid  only  for  a  long 
span  of  time,  which  also  requires  a  large  W  relative  to  ta 
and  tn. 

The  Brownian  motion  model  does  allow  some  behavior  that 
seemingly  cannot  occur  in  a  real  network.  For  example,  the 
process  is  allowed  to  move  in  the  negative  direction,  implying 
that  we  can  lose  work  that  we  have  already  done.  This  is  an 
artifact  of  the  model,  and  it  is  particularly  apparent  at  small  t, 
but  it  is  negligible  for  the  conditions  under  which  the  Brownian 
motion  model  is  useful.  Given  6,  a\ ,  and  t,  and  using  the  fact 
that  cumulative  work  is  normally  distributed,  we  can  compute 
the  probability  of  negative  cumulative  work  as 

/  -bt\  i 


ta,  and  tn.  Thus  as  time  passes  and  t  moves  away  from 
0  and  becomes  large,  -v/(Mfn/2(l  -  pa))t  becomes  quite 
negative  and  $((-bt/ ^crDt)  shrinks  to  near  0.  In  fact,  for 
t  =  Zyfcrllb,  the  3 a  point,  the  probability  that  we  have 
negative  work  is  approximately  0.0023  and  drops  rapidly 
thereafter  to  negligible  amounts  as  t  grows.  This  is  just  further 
confirmation  that  our  model  is  valid  only  for  relatively  large 
W  that  requires  more  than  a  short  time  to  complete. 

Using  b  and  <j\  as  the  parameters  for  our  Brownian  motion, 
and  using  results  in  Karlin  and  Taylor  [33],  we  find  the 
probability  density  of  the  time,  t ,  that  it  takes  for  M  processors 
to  finish  W  minutes  of  work  is 

*0-  l  (2D 


This  has  mean 


and  variance 


W  (ta  +  tn) 
M  ta 


Note  that  for  the  case  M  —  1,  the  mean  and  variance  agree 
with  the  direct  analysis  of  Section  IV-B.  Of  course  the  mean 
is  consistent  with  that  of  Section  III  for  all  M. 

B.  Example:  Exponential  Distributions 

If  we  assume  that  both  available  and  nonavailable  periods 
are  exponentially  distributed,  then  the  mean  and  variance  of 
the  accumulated  work  per  unit  time  are: 

b  =  T~7~7~M  =  paM  (24) 

t'a  “r 

2  __  2(£a£n)2  __  2p2(  1  —  pa)M 

6~(*„  +  *n)3  ■  (25) 

Applying  this  to  (22)  and  (23)  yields  the  mean  of  the  finishing 


7_  W  _  W(ta+tn) 
1  b  M  ta 


and  variance 


f~  if 


2  W  t2n 
M2  ta 


where  $(ar)  is  the  cumulative  density  function  for  a  standard 
normal  distribution.  For  t  near  0,  $(—bt/^/oft)  «  0.5, 
meaning  that  for  very  small  t,  our  model  says  that  almost 
50%  of  the  time  the  program  has  accumulated  a  negative 
amount  of  work.  Clearly,  Brownian  motion  is  a  poor  model 
of  networks  of  transient  processors  for  very  small  t.  If  we 
manipulate  the  expression  —bt/^fofi  we  find  it  is  equal  to 
—  2(1  —  pa))t,  and  the  coefficient  of  t  under  the 

radical  is  greater  than  1  for  any  reasonable  values  of  M, 


Fig.  5  shows  the  finishing  time  density  for  both  the  direct 
and  the  Brownian  motion  analyses  with  ta  =  3600,  tn  ~ 
300,  M  =  1,  and  W  =  105.  We  note  the  good  concordance 
between  the  two  analyses. 

When  we  have  M  —  100  processors,  Fig.  6  shows  the  pdf  of 
finishing  time  for  various  tn  with  ta  —  91  min  and  W  =  104 
min.  Using  ta  —  91  min  and  tn  —  31.305  min  (the  standard 
multiprocessor  example),  we  have  b  =  7AAt  and  —  887.2 1, 
which  leads  to  /  —  0.0134W  and  oj  —  0.00215VF.  Our 
example  job  of  104  min  would  take  about  a  week  to  run  on 
a  single,  dedicated  processor.  When  run  on  a  network  of  100 
transient  processors,  it  would  take  134.06  min,  or  about  2.25  h. 
This  particular  finishing  time  pdf  is  in  Fig.  7,  which  shows  the 
density  both  enlarged  and  plotted  on  a  full  time  axis  (starting 
at  0). 
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Fig.  5.  Finishing  time  densities  for  direct  and  Brownian  motion  analyses. 


Fig.  6.  Finishing  time  densities  for  Brownian  motion  model  for  various  tn . 


Fig.  7.  Finishing  time  density  for  Brownian  motion  analysis. 


C.  Example:  Mutka  and  Livny’s  Distributions 

Let  us  use  the  distributions  measured  by  Mutka  and  Livny. 
We  have  ta  =  91  min,  —  40225  min2,  tn  =  31.305  min, 
and  <r2  =  2131.83  min2.  Plugging  these  into  (18)  and  (19), 
we  find 

b  =  74At  (28) 

(j\  =  6239L  (29) 


This  leads  to  a  finishing  time  mean  and  variance  of 

/  =  0.0134  VP  (30) 

o)  =  0.0151VP.  (31) 

We  note  that  the  finishing  time  variance  using  Mutka  and 
Livny’s  distributions  is  almost  an  order  of  magnitude  more 
than  for  exponential  distributions. 


D.  The  Ratio  &f/f 

It  is  instructive  to  examine  the  coefficient  of  variation  of 
the  finishing  time,  namely  the  ratio  of  Of  to  /: 


°_±  _  VR 

f  VW 


(32) 


We  note  immediately  that  this  ratio  goes  to  zero  as  W 
increases.  Consequently,  for  sufficiently  large  W,  it  may  be 
accurate  enough  to  consider  the  finishing  time  distribution  as 
an  impulse  (i.e.,  the  Dirac  delta  function)  at  the  mean  finishing 
time  (in  the  spirit  of  the  law  of  large  numbers). 

Assume  that  the  available  and  nonavailable  periods  have 
exponential  distributions.  Then  the  ratio  becomes 


ZL 

f 


W  1  +  tn/ta 


(33) 


Because  we  assumed  tn  <  W,  this  ratio  tends  to  be  less  than 
1.  If  we  fix  tn/  W  and  let  tn/ta  go  to  infinity  (which  implies 
ta  — >  0),  the  ratio  goes  to  0.  We  explain  this  by  noting  that 
for  small  ta,  it  takes  very  many  available-nonavailable  cycles 
before  the  work  is  finished.  The  law  of  large  numbers  insures 
that  the  finishing  time  density,  which  is  the  sum  of  these  many 
periods,  will  then  be  tight  about  its  mean. 

If,  on  the  other  hand,  we  let  ta  oo,  the  ratio  of  the 
standard  deviation  to  the  mean  goes  to  zero  once  again.  When 
ta  is  large  relative  tcfTn,  the  nonavailable  periods  become 
negligible,  as  if  the  processors  are  always  available.  Again, 
the  finishing  time  density  becomes  very  tight  about  its  mean 
because  nonavailable  time  periods  add  little  variability  to  the 
finishing  time,  and  under  some  circumstances  we  may  consider 
the  finishing  time  density  as  an  impulse  located  at  t  =  W/M. 
Using  the  standard  multiprocessor  example  again  (M  =  100) 
with  exponential  distributions,  we  find  oj  =  21.53,  and 
approximating  f(t)  as  a  normal  density  (discussed  below),  we 
find  that  90%  of  the  time,  programs  requiring  104  min  of  work 
will  finish  within  7.6  min  of  the  134.4  min  mean  finishing  time, 
which  is  an  interval  3%  on  either  side  of  the  mean.  This  is 
very  narrow  indeed.  If  we  use  Mutka  and  Livny’s  distributions, 
then  90%  of  the  time  programs  finish  in  an  interval  10  min  on 
either  side  of  the  mean,  which  is  still  quite  narrow. 

We  find  the  peak  of  (33)  by  taking  the  derivative  with 
respect  to  ta : 


d  Z_f_  tn(tg  H~  tn)  2 tgtn  (/XA\ 

9ta  f  ~  VWU(ta+tn)Z  ‘  (  ’ 

Setting  this  equal  to  0  yields  ta  =  <„  as  the  peak  of  the  ratio, 
at  which  point  it  takes  on  the  value  Of/f  =  y/tn/2W.  The 
ratio’s  value  at  the  peak  is  small  because  of  our  assumption 
that  tn  <C  W.  Note  that  if  we  assume  ta  =  tn,  but  not 
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Fig.  10.  Finishing  time  density  narrowing  as  ta  grows  (top  figure)  and  as 
ta  shrinks  (bottom  figure). 

tn  =  31.305  min,  M  —  100,  and  W  =  104  min.  In  the  top 
plot,  the  density  narrows  as  ta  =  31.305,  150,  300,  and  800 
min.  In  the  bottom  plot,  we  have  ta  =  31.305,  10,  3,  and  1 
min  as  the  density  narrows. 


Fig.  9.  Two  views  of  the  Brownian  motion  finishing  time  densities  with 
varying  ta. 

tn  <C  W,  then  we  can  make  the  ratio  as  large  as  we  want, 
simply  by  increasing  tn.  If,  for  example,  ta  =  tn  =  1  year, 
then  either  the  system  is  available  immediately  to  do  all  our 
work,  or  else  we  will  have  to  wait  a  very  long  time  before 
it  even  starts.  In  such  a  case,  the  finishing  time  still  has  a 
reasonable  mean  but  an  enormous  variance.  Another  fact  to 
note  is  that  M,  the  number  of  processors,  does  not  affect  the 
ratio  <j f  /  f .  Even  if  we  have  an  infinite  number  of  processors, 
we  can  still  have  great  variance  relative  to  the  mean.  Of  course, 
both  the  mean  and  the  standard  deviation  go  to  zero  as  M 
grows,  but  their  ratio  remains  constant. 

In  Fig.  8  we  plot  oj/f  versus  ta  for  W  =  104,  with  tn/W 
fixed  for  each  curve.  Fig.  9  shows  finishing  time  densities 
for  tn  =  31.305  min  and  various  ta.  The  z-axis  (labeled 
“Time,  Relative  to  Mean”)  is  centered  about  the  mean  and 
plots  the  distance  relative  to  the  mean  (varying  from  0.9  times 
the  mean  to  1.1  times  the  mean).  We  note  that  the  density 
is  flattest  and  has  the  greatest  spread  for  ta  =  tn;  at  this 
point  Of] f  =  0.035,  which  is  quite  small.  For  comparison, 
if  we  use  Mutka  and  Livny’s  distributions  at  the  same  point, 
the  ratio  is  0.092,  which  is  still  small.  The  narrowing  of  the 
density  is  also  illustrated  in  Fig.  10.  The  parameters  for  both 
plots  are:  exponential  distributions  with  varying  ta  but  fixed 


E,  Normal  Approximation  to  the  Finishing  Time 

The  usual  form  of  the  central  limit  theorem  states  that  the 
sum  of  n  independent  random  variables  tends  to  have  a  normal 
distribution  as  n  gets  large.  Given  this,  we  would  expect  the 
limiting  distribution  of  f(t)  to  be  normal  with  mean  /  and 
variance  crj.  Let  us  denote  this  normal  approximation  by  /(£): 


e-(t-/)2/(  2**)' 


(35) 


Substituting  t  =  /  shows  that  the  finishing  time  and  its  normal 
approximation  coincide  at  the  mean: 


=  /(/)• 


Observation  shows  that  f(t)  and  f(t)  also  coincide  at  two 
more  points,  but  analytically  these  are  not  easily  found  because 
they  are  the  solutions  to  a  transcendental  equation.  Numer- 
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^Brownian  Motion  Model 


ta  =  91  min. 
tn  =  31 .305  min. 
M  =  100 


?  /  vx,  Normal  Approximation 

h  t  - 1 —  7**^  "  .  iTime  to  Finish 

0.  0.05  0.1  0.15  0.2  0.25  0.3  0.35  (mjnutes) 
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Approximate 


f»  =  91  min. 
f«  =  31 .305  min. 
W  =  100 
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Fig.  12.  Density  of  finishing  time  and  its  lognormal  approximation. 


W=  10000  min. 


Time  to  Finish  I 


120.  125.  130.  135.  140.  145.  150. 


Fig.  11.  Brownian  motion  finishing  time  pdf  and  its  normal  approximation. 


ically,  we  find  that  these  points  appear  to  be  separated  by 
Jl2o2,  and  the  distance  from  the  lower  point  (smaller  t)  to 
the  mean  is  very  slightly  less  than  half  of  the  total  distance 
between  the  two  points  (varying,  but  in  the  range  of  49.5% 
of  the  total  separation). 

We  need  to  know  when  f(t)  is  a  good  approximation  for 
f(t).  Observation  (see  Fig.  11)  shows  that  the  approximation 
is  good  when  the  mode  of  the  finishing  time  is  close  to  (within 
a  few  percent  of)  its  mean.  We  find  the  mode  by  taking  the 
derivative  of  f(t)  with  respect  to  t ,  setting  it  equal  to  0,  and 
solving  for  t.  We  end  up  with  a  quadratic  equation  that  has 
a  negative  and  a  positive  root.  The  positive  root  is  the  mode, 
namely 


^mode  —  2  } 


htf)2 


Probability 


Accumulated 


Work 


Amount  of 
WorkCompleted 
by  t 


By  using  the  fact  that  \Ja  +  b  <  y/a  +  \/6,  we  show  that  the 
mode  is  always  less  than  or  equal  to  the  mean: 

*3 _f_7. 

262  b  2 b2  b 

Furthermore,  if  we  observe  that  9 (crf)2/b4  is  usually  much  less 
than  4 W2  fb  ,  and  we  use  the  approximation  \f  \  +  e  «  1  +  | 
for  0  <  e  <  1,  then 

t  ,  ^W_3al(  3 £f\ 
mode  b  4  Wb) 

*7 -If-  (37) 

Under  almost  all  circumstances,  we  may  drop  the  negative 
term  in  the  parenthesis,  because  when  the  Brownian  motion 


Fig.  13.  Density  of  finishing  time. 


approximation  is  valid  we  also  know  that  W  >  cq,  /6,  and  in 
general,  W  >  df>.  These  would  render  the  term  Sa2/4Wb 
negligible.  Only  under  very  unusual  circumstances  would 
W  a2,  and  in  such  cases  we  could  not  drop  the  term. 
Excepting  such  circumstances,  (37)  is  quite  accurate  for  all 
conditions  in  which  our  Brownian  motion  model  is  operative. 
Using  (37),  we  find  that  the  percent  difference  between  the 
mean  and  mode  is  approximately  3<jb/2bW. 


F.  Lognormal  Approximation  to  Finishing  Time 

A  lognormal  density  provides  a  remarkably  good  approxi¬ 
mation  to  (21).  A  lognormal  distribution  has  two  parameters, 
pi  and  of.  If  we  equate  the  mean  and  variance  of  the  finishing 
time  pdf  from  the  Brownian  motion  model  to  that  of  a 


.  ItLEINRQCK  AND  KORFHAGE:  COLLECTING  UNUSED  PROCESSING  CAPACITY 


545 


Fig.  14.  Simulation  results. 


lognormal  pdf,  then  we  find  that  these  parameters  must  be 

«  =  ln(7)  -  \°i 


a\  =  In  (o///2  +  1). 


(38) 

(39) 


The  lognormal  pdf  fit  to  (21)  then  becomes 


*(*)  = 


1 


exp 


(In (t)  - 


(40) 


As  shown  in  Fig.  12,  when  both  the  Brownian  motion  finishing 
time  pdf  and  the  lognormal  approximation  are  plotted,  the 
densities  are  extremely  close,  and  the  plotted  curves  appear 
to  lie  on  top  of  each  other.  When  the  two  do  differ,  it  is  under 
circumstances  where  the  assumptions  of  the  Brownian  motion 
model  do  not  hold  (e.g.  W  small  relative  to  all  of  M,  ta,  and 

tn\ 


G.  The  Finishing  Time  Density  and  its 
Derivation  from  a  Normal  pdf 

We  can  rewrite  the  Brownian  motion  finishing  time  density, 
(21),  in  a  form  using  a  normal  pdf.  Let  be  the 

probability  density  that  a  random  variable,  normally  distributed 
with  mean  p  and  variance  a2,  takes  on  the  value  x.  Using 
this,  (21)  becomes 

m  =  (41) 

The  normal  pdf  term  derives  from  the  underlying  Brownian 
motion;  it  is  the  probability  that  a  total  of  W  units  of  work 
have  been  accumulated  by  time  t.  As  for  the  weighting  factor 
of  W/t,  no  intuitive  explanation  has  yet  been  found  for  this. 
Fig.  13  illustrates  the  relationship  between  (21)  and  (41).  In 
this  figure,  the  normal  pdf  of  the  amount  of  work  done  by 
time  x ,  4%to. 2t(x)>  is  plotted  with  thin  lines  for  various  t. 
The  shaded  plane  in  the  figure  picks  out  those  points  on  the 
normal  pdf  where  x  —  W;  the  line  arcing  down  (top  left  to 
lower  right  in  the  bottom  view)  within  this  plane  represents 


W/t.  The  other  line  within  the  shaded  plane  is  the  finishing 
time  density,  i.e.,  the  product  of  these  last  two  curves.  The 
curves  have  been  scaled  differently  to  make  them  fit  into  one 
plot,  so  relative  heights,  except  within  the  group  of  normal 
curves,  are  meaningless. 


//.  Simulation  Results 

We  ran  simulations  for  the  case  of  exponentially  distributed 
available  and  nonavailable  periods.  Some  results  comparing 
the  simulation  to  the  Brownian  motion  model  are  shown  in 
Fig.  14.  The  Brownian  motion  model  and  the  simulation  agree 
very  well  for  large  W,  as  we  would  expect,  and  they  deviate 
as  W  becomes  small. 


VI.  Conclusion 

In  this  paper,  we  analyzed  the  distribution  of  the  time  to 
finish  a  distributed  program  running  in  a  network  of  transient 
processors.  We  first  made  two  analyses  of  a  program  running 
on  a  single  transient  processor.  These  results  were  then  used 
as  the  basis  for  a  Brownian  motion,  multiprocessor  model, 
and  from  this  we  found  four  finishing  time  distributions:  the 
actual  Brownian  motion  finishing  time  distribution,  and  its 
normal,  lognormal,  and  impulse  approximations.  The  models 
in  this  paper  offer  an  approach  to  predicting  performance  of 
distributed  programs  on  transient  processors.  By  relaxing  some 
of  our  assumptions,  as  discussed  below,  more  sophisticated 
models  could  be  derived  from  those  that  have  been  described. 

The  first  assumption  that  we  would  like  to  relax  concerns 
the  asymptotic  nature  of  the  results.  The  results  we  have 
given  are  valid  only  over  a  long  period  of  time.  If  we  have 
a  relatively  small  amount  of  work  to  do  (say,  several  hours), 
then  our  finishing  time  distributions  are  not  valid.  Their  means 
are  acceptable,  but  the  variance  is  quite  incorrect,  and  the 
distributions  (except  for  the  impulse  approximation)  show 
noticeable  probability  that  the  program  will  finish  in  less 
than  the  minimal  time  required  (W/M).  Judging  from  the 
simulation  results,  there  may  well  be  some  simple  way  to 
heuristically  modify  the  variance  expression  in  our  models  so 
they  provide  acceptable  results  for  small  W. 
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A  second  assumption  we  would  like  to  relax  concerns  our 
model  of  a  program.  Modeling  an  algorithm  as  sequential, 
independent  stages  is  very  simplistic.  Many  programs  do  not 
have  clear  stages,  but  instead  have  a  more  complex  internal 
precedence  structure  among  the  tasks  of  the  program  which 
cause  additional  delays.  The  independent-stages  model  may 
provide  a  useful  simplification,  but  testing  this,  and  developing 
more  complicated  program  models,  remains  for  future  work. 

A  third  assumption  of  great  importance  is  that  our  network 
model  does  not  account  for  the  realities  of  communications. 
Communication  entails  delay,  and  our  model  does  not  address 
this  issue.  Solutions  to  this  are  currently  under  investigation, 
and  some  possibilities  are  mentioned  in  [1]. 

The  models  in  this  paper  do  have  many  assumptions,  yet 
their  very  simplicity  makes  them  appealing.  An  alternative 
to  our  models  is  performability  analysis,  yet  the  complexities 
of  performability  models  yield  only  numeric  results  or  a 
Laplace  transform,  and  not  a  direct  analytic  expression  for 
the  distribution  of  program  completion  time.  Furthermore,  the 
basic  parameters  of  our  models  can  be  modified  to  remove 
some  of  the  assumptions,  and,  at  least  for  large  W,  the  models 
already  do  capture  the  basic  behavior  of  transient,  distributed 
systems. 
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Abstract 

We  present  a  parallel  implementation  of  Iterative-Deepening-A*,  ' 
a  depth-first  heuristic  search,  on  the  single-instruction,  multiple-data 
(SIMD)  Connection  Machine.  Heuristic  search  of  an  irregular  tree 
represents  a  new  application  of  SIMD  machines.  The  main  technical 
challenge  is  load  balancing,  and  we  explore  three  different  techniques 
in  combination.  We  also  present  a  simple  method  for  dynamically 
determining  when  to  stop  searching  and  start  load  balancing.  We 
achieve  an  efficiency  of  67%,  for  a  speedup  of  5500,  with  8K  processors, 
and  an  efficiency  of  57%,  for  a  speedup  of  9300,  with  16K  processors. 

"This  work  is  supported  in  part  by  W.  M.  Keck  Foundation  grant  number  W880615, 
NSF  Biological  Facilities  grant  number  BBS  87  14206,  the  Defense  Advanced  Research 
Projects  Agency  under  Contract  MDA  903-87-C0663,  an  NSF  Presidential  Young  In¬ 
vestigator  Award  to  the  third  author,  Rockwell  International,  the  Advanced  Computing 
Facility,  Mathematics  and  Computer  Science  Division,  Argonne  National  Laboratory,  and 
by  Thinking  Machines  Corporation. 
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searches,  or  iterations.  In  each  iteration,  a  branch  is  cut  off  when  the  f(n) 
value  of  the  last  node  on  the  path  exceeds  a  cost  threshold  for  that  iteration. 
The  threshold  for  the  first  iteration  is  set  to  the  heuristic  value  of  the  initial 
state  and  each  succeeding  threshold  is  set  to  the  minimum  /  value  that 
exceeded  the  previous  threshold.  Successive  iterations  continue  until  a  goal 
node  is  chosen  for  expansion.  Since  at  any  point  it  is  performing  a  depth-first 
search,  IDA^’s  memory  requirement  is  only  linear  in  the  solution  depth.  As 
a  result,  IDA*  can  find  optimal  solutions  to  the  Fifteen  Puzzle,  but  some 
problem  instances  require  tens  of  hours  on  current  uniprocessors,  motivating 
the  use  of  parallel  processing. 

1.2  Parallel  Heuristic  Search 

Parallel  processing  can  dramatically  reduce  the  time  of  a  search  algorithm. 
There  are  essentially  three  different  approaches  to  parallelizing  search  algo¬ 
rithms.  The  first  is  to  parallelize  the  tasks  associated  with  the  processing  of 
individual  nodes,  such  as  move  generation  and  heuristic  evaluation.  This  is 
the  approach  taken  by  HITECH  [5]  and  Deep-Thought [15],  both  of  which  use 
special-purpose  hardware  in  an  eight  by  eight  array  to  compute  chess  moves 
in  parallel.  The  speedup  achievable  in  this  scheme  is  limited,  however,  by 
the  degree  of  parallelism  available  in  move  generation  and  evaluation.  In 
addition,  this  approach  is  inherently  domain-specific,  and  unlikely  to  lead  to 
general  techniques  for  using  parallel  processors  to  accelerate  search. 

A  second  approach  is  called  parallel  window  search.  This  method  gives 
each  processor  the  entire  tree  to  search,  but  divides  the  range  of  the  cost 
bounds  (windows)  among  the  processors.  Parallel  window  search  was  origi¬ 
nated  by  Gerard  Baudet  [l]  for  use  in  searching  two-player  games  trees  [24], 
but  has  also  been  applied  to  IDA*  [27]  [28]  [29],  where  different  processors  are 
assigned  different  thresholds  to  search.  When  a  processor  completes  search¬ 
ing  to  its  assigned  threshold,  it  then  searches  the  next  unassigned  threshold. 
Unfortunately,  this  approach  is  limited  by  the  time  to  perform  the  goal  iter¬ 
ation.  To  overcome  this  limitation,  parallel  window  search  can  be  combined 
with  node  ordering  to  reduce  the  time  taken  by  the  last  iteration. 

The  third,  and  perhaps  most  obvious  approach,  is  tree  decomposition. 
While  parallel  window  search  assigns  each  processor  the  entire  tree  to  search, 
tree  decomposition  divides  the  tree  into  subtrees,  assigning  different  subtrees 
to  different  processors.  In  principle,  tree  decomposition  allows  the  effective 
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2  Overview 


We  first  present  our  basic  algorithm,  SIDA*,  which  stands  for  SIMD  IDA*, 
lie  main  idea  is  to  divide  the  tree,  and  assign  different  processors  to  search 
1  erent  subtrees.  By  itself,  this  results  in  very  poor  speedup,  since  in  an 
irregular  tree,  some  processors  will  finish  searching  their  subtrees  long  before 
others,  and  must  remain  idle  while  waiting  for  the  others  to  complete.  Thus, 
the  mam  challenge  is  to  perform  effective  load  balancing  in  order  to  maintain 
nearly  equal  amounts  of  work  on  each  processor  and  minimize  idle  processor 
overhead.  Load  balancing  can  be  used  in  the  initial  distribution  of  nodes 
to  processors,  between  successive  iterations  of  an  iterative-deepening  search, 
and  also  within  a  particular  iteration.  Choosing  the  right  times  to  load  bal¬ 
ance  is  critical  to  performance,  and  we  present  two  different  load-balance 
triggering  strategies.  We  show  that  overall  efficiency  can  be  decomposed 
into  the  product  of  four  different  intuitive  performance  parameters.  We  then 
discuss  the  results  of  implementing  SIDA*  on  a  Connection  Machine  for  the 
Fifteen  Puzzle  domain,  and  demonstrate  a  speedup  of  5500  on  8K  proces¬ 
sors,  and  9300  on  16K  processors.  Next,  we  present  an  analytic  model,  based 
on  an  exponential  distribution  of  work,  and  analyze  performance  as  a  func¬ 
tion  of  the  load-balancing  mechanism.  We  also  compare  our  work  to  several 
other  SIMD  tree-search  algorithms.  Finally,  we  discuss  further  work  and  our 
conclusions. 

IDA*  is  merely  one  example  of  a  depth-first  search  of  an  irregular  tree. 
Other  examples  include  two-player  alpha-beta  minimax  search,  backtracking 
for  constraint-satisfaction  problems,  and  depth-first  branch- and-bound  for 
combinatorial  optimization  problems.  The  techniques  presented  here  are 
largely  independent  of  the  domain  and  particular  search  algorithm,  and  hence 
should  be  applicable  to  other  depth-first  searches  as  well. 

3  Basic  algorithm 

The  basic  SIDA*  algorithm  consists  of  an  initial  distribution  of  frontier  nodes 
to  processors,  followed  by  a  series  of  iterations  of  IDA*.  In  each  iteration, 
each  processor  independently  performs  a  depth-first  search  of  the  subtree 
below  its  frontier  node.  An  iteration  is  not  complete  until  all  processors 
have  completed  their  searches.  During  an  iteration,  when  enough  processors 
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3.2  SIMD  Depth-First  Search 

Once  the  initial  distribution  is  completed,  each  processor  conducts  a  depth- 
first  search  of  its  assigned  frontier  node.  Although  processors  independently 
perform  their  searches  on  their  own  subtrees,  they  all  use  the  same  global 
cost  threshold.  Collectively,  the  group  of  processors  performs  an  iteration  of 
IDA*  on  the  entire  tree,  with  each  processor  performing  its  iteration  on  its 
own  subtree.  When  all  subtrees  have  been  explored,  the  global  threshold  is 
increased  for  the  next  iteration  of  IDA*. 

Processors  use  a  stack  to  represent  the  current  search  path  in  the  tree. 
If  we  think  of  the  root  of  the  tree  being  at  the  ’top’,  then  generating  a 
state  corresponds  to  moving  down  the  stack,  and  returning  to  a  previously 
generated  state  in  a  shallower  part  of  the  tree  corresponds  to  ‘backing  up’ 
the  stack. 

The  stack  can  either  contain  the  sequence  of  moves  associated  with  the 
path,  or  can  contain  the  sequence  of  states  created  by  making  moves  on  the 
path.  The  first  approach  requires  only  storing  the  current  state  in  the  search. 
Generating  a  new  node  involves  just  moving  one  tilg  in  this  global  state.  In 
backing  up,  it  is  necessary  to  ‘undo’  the  previous  move  by  moving  the  tile 
back  to  its  previous  location.  This  approach  is  analogous  to  what  a  person 
would  do  in  solving  a  Fifteen  Puzzle  by  hand. 

The  second  approach  is  to  make  a  separate  copy  of  each  newly  generated 
state,  with  the  appropriate  modification,  and  place  this  on  the  stack.  This 
approach  has  the  advantage  of  not  requiring  a  move  to  be  undone  when 
backing  up  the  stack.  It  has  the  disadvantage  of  the  overhead  required  to 
copy  each  state.  For  most  problems,  it  is  faster  to  record  an  incremental 
change  to  the  state,  than  to  copy  the  entire  state,  and  hence  we  adopt  the 
former  approach. 

Unfortunately,  the  trees  generated  by  almost  all  heuristic  searches  have 
irregular  branching  factors  and  depths.  This  means  that  during  the  search 
phase,  each  processor  will  have  a  different  size  subtree  to  search,  and  once 
a  processor  completes  its  search,  it  must  wait  for  all  other  processors  to 
complete,  before  beginning  the  next  iteration.  This  results  in  tremendous 
idle-processor  overhead,  since  the  variation  in  the  sizes  of  the  individual  sub¬ 
trees  is  enormous  in  practice.  Thus,  load  balancing  is  necessary  to  effectively 
use  a  large  number  of  processors,  and  is  a  central  focus  of  this  research. 
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set  of  nodes  of  minimum  /  value  are  expanded  in  parallel.  The  threshold 
for  the  first  iteration  of  the  search  phase  is  set  to  the  minimum  /  value  of 
all  the  frontier  nodes  at  the  end  of  the  distribution  phase.  Those  processors 
assigned  nodes  whose  /  value  is  greater  than  the  minimum  do  not  initially 
participate  in  the  first  search  phase,  and  only  become  active  after  the  first 
load  balance. 

Since  the  distribution  phase  is  done  only  once  during  the  start  of  the 
search,  regardless  of  how  accurately  the  load  is  balanced  during  this  phase, 
the  loads  will  become  unbalanced  as  the  search  progresses  through  multiple 
iterations. 

4.2  Load  Balancing  Between  Iterations 

After  each  iteration  of  the  search  phase  is  completed,  there  is  another  oppor¬ 
tunity  for  load  balancing,  based  on  information  collected  during  the  previous 
iteration.  In  particular,  the  actual  number  of  nodes  expanded  below  each 
frontier  node  during  the  last  iteration,  the  load  of  that  node,  is  a  lower  bound 
on  the  number  that  must  be  expanded  on  the  next  iteration.  Furthermore,  in 
the  absence  of  any  further  information,  we  expect  the  relative  loads  of  differ¬ 
ent  frontier  nodes  to  remain  reasonably  stable  from  one  iteration  to  the  next. 
In  other  words,  a  node  with  a  relatively  heavy  load  during  one  iteration  is 
likely  to  have  a  relatively  heavy  load  during  the  next  iteration  as  well.  Given 
this  assumption,  we  would  like  to  reallocate  processors  to  frontier  nodes  so 
that  lightly  loaded  nodes  lose  their  processors  to  more  heavily  loaded  ones. 

The  challenge  is  to  accomplish  this  reallocation  while  maintaining  the 
one-to-one  correspondence  between  frontier  nodes  and  processors.  Consider 
a  frontier  node  that  was  relatively  lightly  loaded  on  the  last  iteration,  and 
whose  brothers  were  also  lightly  loaded.  We  can  contract  this  node  by  dis¬ 
carding  the  children,  and  making  the  parent  a  frontier  node.  One  of  the 
processors  assigned  to  the  children  is  then  assigned  to  the  parent,  and  the 
remaining  processors  are  free  to  be  assigned  elsewhere.  Node  contraction  was 
introduced  by  Chakrabarti  et  al  [2],  to  save  memory  in  a  serial  version  of  A*, 
and  also  used  by  Evett  et  al  [7]  in  a  SIMD  version  of  A*. 

Now  consider  a  node  that  was  relatively  heavily  loaded  on  the  last  it¬ 
eration.  It  is  expanded,  generating  its  children.  The  parent  now  becomes 
an  interior  node,  and  the  processor  assigned  to  the  parent  is  reassigned  to 
one  of  the  children.  The  remaining  children  are  assigned  processors  from  the 
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processor  writes  its  processor  ID  to  a  location  associated  with  its  sequence 
number  in  the  list,  which  is  then  read  by  the  active  processor  with  the  same 
numbering  in  its  list.  As  a  result,  each  active  processor  obtains  the  ID  of 
a  unique  fiee  processor  to  send  work  to.  The  processors  are  arranged  in 
a  hypercube  communication  network,  and  the  time  to  transfer  work  is  a 
function  of  how  many  hops  through  the  network  the  work  must  travel.  The 
rendezvous  allocation  method  is  global  in  that  there  is  no  consideration  of 
matching  processors  that  are  logically  close  in  the  network.  An  alternative 
approach  is  to  consider  locality  in  the  hope  of  reducing  communication  costs 

[3]. 

W  hen  the  number  of  processors  with  work  to  share  exceeds  the  number  of 
free  processors,  priority  is  given  to  processors  transferring  nodes  with  lower 
/  values.  The  reason  is  that  nodes  with  lower  /  values  are  expected  to  have 
larger  subtrees  associated  with  them. 

Work  that  is  shared  within  iterations  represents  only  a  temporary  reas¬ 
signment  of  nodes  to  processors,  and  does  not  change  the  allocation  of  frontier 
nodes  to  processors  that  was  in  effect  at  the  beginning  of  the  iteration. 

After  the  load  balancing  is  complete,  the  searth  phase  resumes,  until 
enough  processors  become  idle  again  to  justify  another  phase  of  load  balanc¬ 
ing.  Thus,  each  iteration  consists  of  an  alternating  sequence  of  search  and 
load  balancing  phases. 


5  When  To  Load  Balance 

An  important  problem  is  determining  when  to  trigger  load  balancing.  If 
load  balancing  is  performed  too  frequently,  then  a  large  load  balancing  cost 
is  incurred.  If  load  balancing  is  not  performed  frequently  enough,  then  low 
utilization  will  result.  We  discusses  two  solutions  to  this  problem. 


5.1  Constant  Triggering 

The  simplest  approach  is  to  set  a  constant  load-balance  trigger  X ,  between 
0  and  1.  After  the  initial  work  distribution,  all  P  processors  run  until  the 
number  of  processors  remaining  active  drops  to  X  *  P,  meaning  (1  —  X)  * 
P  processors  have  become  idle.  When  this  trigger  X  is  reached,  a  load 
balancing  phase  takes  place  in  which  some  of  the  work  on  the  busy  processors 
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respect  to  ts  and  setting  it  to  0. 


i  *  no  =  w(t) 

clt  t  +  L  -{t  +  L)2  +  t  +  L  ~  ^ 


t  +  L 


w"(0 


(  ),  the  derivative  of  W{t),  is  the  instantaneous  rate  at  which  work  is  being- 
done  by  the  system  at  time  t,  which  is  just  the  number  of  active  processors  at 
time  t.  or  A{t).  Thus,  the  average  rate  of  work  over  the  current  search/load 
balance  cycle  is  maximized  by  load  balancing  when  W(t)/(t  +  L)  =  A(t) 

Since  -l(0  changes  discretely,  load  balancing  should  be  performed  as  soon  as 
+  L)  >  .4(f). 

This  dynamic  triggering  equation  is  quite  general,  and  should  apply  to  any 
problem  where  work  and  load  balancing  must  be  performed  in  distinct  phases. 
Bv  greedily  optimizing  the  rate  of  work  over  each  search/load  balancing 
cycle  we  hope  to  choose  nearly  optimal  times  for  load  balancing.  Since  the 
variables  m  the  equation  are  easily  computed  by  the  program,  this  trigger 
is  straightforward  to  implement.  Most  importantly  however,  this  triggering 
mechanism  automatically  adjusts  itself  to  any  size  problem,  and  to  different 
stages  within  a  single  problem.  * 

Figure  1  shows  actual  data  for  the  way  in  which  the  number  of  active 
processors  varies  over  time,  using  this  dynamic  trigger  for  an  iteration  of  the 
b  lfteen  puzzle  in  which  94  million  node  generations  were  performed.  The  time 
to  do  load  balancing  is  not  shown.  The  area  under  the  curve  represents  the 
total  work  done.  Note  that  as  the  iteration  progresses,  the  load- balancing 
trigger,  which  is  the  bottom  envelope  of  the  curve,  decreases  as  expected. 
When  there  is  less  work  remaining,  the  number  of  active  processors  drops  off 
faster,  and  the  trigger  waits  for  sufficient  work  to  be  done  before  investing 
m  another  load  balance.  Also  note  that  towards  the  end  of  the  iteration,  the 
final  load  balances  do  not  reactivate  all  the  processors.  The  reason  is  that 
there  are  not  enough  nodes  on  the  tops  of  the  stacks  of  the  active  processors 
to  supply  all  the  idle  processors. 


6  Measures  of  Performance 

Before  describing  our  experimental  results,  we  describe  our  speedup  and  effi¬ 
ciency  measures,  and  factor  efficiency  into  four  different  components.  These 
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measures  apply  to  any  SIMD  applications  that  require  load-balancing.  Per¬ 
formance  can  be  measured  over  any  desired  interval,  such  as  the  distribution 
phase,  during  a  single  iteration,  over  an  entire  problem,  or  over  a  group  of 
problems.  Hence,  in  the  following  discussion  the  term  ‘problem’  can  refer  to 
part  of  a  problem,  an  entire  problem,  or  multiple  problems. 

6.1  Speedup  and  Efficiency 

Our  primary  measure  of  performance  is  speedup ,  5,  which  is  the  time  that 
would  be  required  by  the  most  efficient  version  of  serial  IDA*  running  on 
one  processor,  Tida~  divided  by  the  time  required  by  our  parallel  algorithm 
to  solve  a  problem  using  P  processors,  T.  To  save  years  of  computation, 
we  estimated  the  sequential  time  by  running  an  efficient  IDA*  program  on 
a  single  Connection  Machine  processor,  for  an  entire  iteration  of  a  single 
problem.  The  work  done,  W/£» t,  measured  in  node  generations,  divided  by 
the  time,  Tida ,  gives  the  IDA*  work  rate,  W'lDA.  The  ‘prime’  in  W'IDA 
reflects  that  it  is  the  rate  of  work  per  unit  time. 

Efficiency ,  E ,  is  simply  speedup  divided  by  the  number  of  processors, 
P.  Since  we  are  interested  in  knowing  what  factors  contribute  to  overall 
efficiency,  we  divide  efficiency  into  four  components:  raw  speed  ratio,  fraction 
of  time  working,  utilization,  and  work  ratio.  Besides  providing  valuable 
information,  calculation  of  these  factors  provides  a  redundant  check  on  the 
correctness  of  the  efficiency  calculation.  We  discuss  each  of  these  four  factors 
in  turn,  then  show  that  their  product  equals  efficiency. 

6.2  Raw  Speed  Ratio 

The  raw  speed  ratio  reflects  the  fact  that  a  single  active  processor  in  the 
parallel  algorithm,  SIDA*,  performs  work  at  a  slower  rate  than  a  single  CM 
processor  executing  serial  IDA*.  The  main  reason  for  this  is  the  overhead  of 
conditional  statements  on  a  SIMD  machine.  Since  all  processors  must  execute 
the  same  instruction,  to  execute  a  conditional,  all  processors  choosing  the  first 
branch  execute  it,  while  the  remaining  processors  are  idle,  and  then  those 
choosing  the  second  branch  execute  it,  while  the  first  group  is  idle.  Thus, 
the  time  to  execute  a  conditional  statement  on  a  SIMD  machine  is  the  sum 
of  the  times  to  execute  each  branch,  rather  than  a  weighted  average,  as  on  a 
serial  or  MIMD  machine. 
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6.5  Work  Ratio 

One  final  factor  that  affects  performance  is  that  the  total  work  done  by  the 
serial  and  parallel  algorithms  may  be  different.  On  iterations  prior  to  the 
goal  iteration,  both  algorithms  generate  approximately  the  same  number  of 
nodes.  The  numbers  are  not  exactly  equal  because  the  serial  algorithm  must 
start  each  of  it  its  depth-first  searches  from  the  root  node,  whereas  in  the 
parallel  algorithm  each  processor  starts  from  a  frontier  node.  This  difference 
is  insignificant  in  all  but  the  easiest  problems. 

The  final  iteration,  however,  terminates  as  soon  as  a  goal  node  is  chosen 
for  expansion.  Depending  on  the  order  in  which  nodes  are  generated,  and 
the  location  of  the  goal  nodes,  either  the  serial  or  the  parallel  algorithm  may 
explore  more  total  nodes  on  the  final  iteration.  We  will  argue  in  section  7.2 
that  for  this  application,  this  effect  should  not  be  included  in  the  efficiency, 
but  is  important  in  other  applications. 

The  work  ratio,  N.  is  defined  as  the  total  work  done  by  the  serial  al¬ 
gorithm,  WIDA,  divided  by  the  total  work  done  by  the  parallel  algorithm, 
IV .  “  7 


N  =  Work  Ratio  = 

w 

6.6  Effect  of  Initial  Distribution 

The  initial  distribution  of  work  to  processors  is  a  source  of  inefficiency  which 
is  indirectly  included  in  several  of  the  above  factors.  First,  the  raw  speed 
ratio  is  lower  during  distribution  because  node  generations  require  commu¬ 
nication  across  processors  to  transmit  states.  Second,  utilization  is  lower 
during  distribution  because  it  is  impossible  to  obtain  100%  utilization  when 
the  machine  is  being  filled  with  work. 


7  Empirical  Results 

We  implemented  SIDA*  with  the  dynamic  trigger  of  section  5.2  on  a  16K 
processor  Connection  Machine  2  (CM2)  [14],  and  ran  all  100  Fifteen  Puzzle 
problem  instances  from  [18].  The  speedups  tend  to  be  smaller  on  the  easier 
problems,  since  they  can’t  make  full  use  of  16K  processors,  and  the  time 
of  the  distribution  phase  becomes  significant.  In  computing  the  parameters 
derived  in  section  6,  we  treat  all  100  problems  as  if  they  were  a  single  large 
problem.  This  has  the  effect  of  weighting  the  difficult  problems  more  heavily 
than  the  easy  ones,  which  is  appropriate  since  most  of  the  time  is  spent 
solving  the  harder  problems. 

7.1  Overall  Results 

Using  this  approach,  the  speedup  over  all  100  problems  on  16K  processors 
was  about  7800,  which  corresponds  to  an  efficiency  of  almost  48%.  A  total  of 
W[DA  =  36  billion  nodes  were  generated  by  serial  IDA* [18].  Dividing  this  by 
the  node  generation  rate  of  IDA*  on  one  CM  processor,  Wida  =  118  nodes 
per  second,  gives  an  estimated  total  time  for  IDA*  of  Tida  =  84, 746  hours, 
or  9.7  years!  The  total  time  taken  by  SIDA*  to  do  all  100  problems  using 
16K  processors  was  T  =  10.8  hours.  The  speedup  is  therefore  5  =  T[da/T  — 
7800. 

The  average  and  median  times  to  solve  a  single  problem  instance  were 
6.5  and  2.1  minutes,  respectively,  while  the  average  and  median  number  of 
node  generations  were  430,780,150  and  59,512,527,  respectively. 

The  overall  efficiency  is  the  product  of  a  raw  speed  ratio  of  .772,  fraction 
of  time  working  of  .879,  processor  utilization  of  .839,  and  work  ratio  of  .836. 

The  initial  distribution  phase  requires  about  33  seconds  with  16K  proces¬ 
sors,  and  its  effect  on  these  factors  is  insignificant  on  the  hardest  problems, 
since  it  occurs  once  for  each  problem  instance,  and  only  accounts  for  a  tiny 
fraction  of  the  total  node  generations. 

The  .879  fraction  of  time  working  indicates  that  12%  of  the  total  time 
was  spent  load  balancing.  1.7%  of  the  total  time  was  spent  on  load  balancing 
between  iterations,  talcing  about  6.7  seconds  per  iteration.  This  overhead  is 
negligible  since  it  consists  of  a  small  constant  amount  of  work  per  iteration, 
and  the  number  of  iterations  grows  lineaxly  with  problem  difficulty,  while 
node  generations  grow  exponentially.  The  rest  of  the  load  balancing  time  is 
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among  the  virtual  processors.  Therefore,  regardless  of  whether  the  observed 
value  of  the  work  ratio  is  due  to  noise,  or  is  consistently  greater  than  one, 
the  work  ratio  should  be  discounted  in  the  efficiency  measure,  by  setting  it 
to  one.  If  we  set  the  work  ratio  to  one,  then  the  efficiency  becomes  almost 
57%  on  16K  processors,  which  is  the  product  of  the  remaining  three  factors. 
This  is  a  more  accurate  measure  of  the  performance  of  SID  A*,  and  thus  we 
refer  to  the  overall  efficiency  as  57%,  with  a  corresponding  speedup  of  over 
9300.  Note  that  in  other  applications,  such  as  alpha-beta  search,  any  parallel 
algorithm  must  search  more  nodes  than  the  serial  algorithm,  and  hence  in 
these  applications  the  work  ratio  must  be  taken  into  account. 

Our  speedup  is  on  the  order  of  four  orders  of  magnitude  with  16K  proces¬ 
sors.  This  is  relative  to  IDA*  running  on  one  Connection  Machine  processor, 
which  is  about  three  orders  of  magnitude  slower  than  current  workstations, 
yielding  a  performance  improvement  relative  to  a  high-performance  worksta¬ 
tion  of  about  an  order  of  magnitude. 

7.3  Scalability 

Next,  we  consider  how  our  results  scale  as  the  number  of  processors  in¬ 
creases.  Ideally,  efficiency  would  remain  constant  as  the  number  of  proces¬ 
sors  increases.  In  practice,  however,  for  a  given  problem,  as  the  machine  gets 
larger,  efficiency  decreases  somewhat  because  there  is  not  enough  work  to 
utilize  all  the  processors.  Conversely,  for  a  given  size  machine,  as  a  problem 
gets  harder,  utilization  increases,  a  smaller  percentage  of  time  is  spent  load 
balancing,  and  efficiency  increases. 

To  illustrate  this,  we  picked  a  sample  of  problems  of  varying  difficulty. 
We  ordered  the  problems  of  [18]  by  increasing  nodes  generated,  and  picked 
problems  1,  25,  50,  70,  and  90  in  that  ordering.  These  corresponds  to  problem 
numbers  79,  81,  41,  98,  and  63,  respectively,  of  [18].  We  solved  each  problem 
using  from  2K  to  16K  processors  in  2K  increments,  and  plotted  speedup 
versus  number  of  processors  for  each  problem  in  figure  2. 

In  general,  the  harder  the  problem,  the  greater  the  speedup  for  a  given 
number  of  processors.  This  implies  that  larger  machines  will  be  more  cost 
effective  on  harder  problems  than  on  easier  problems,  as  expected.  The 
Connection  Machine,  for  example,  is  currently  available  with  up  to  64K  pro¬ 
cessors. 

We  also  ran  all  100  problems  on  8K  processors.  Counting  them  all  as 
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one  task,  efficiency  increased  from  57%  to  67%  when  going  from  16K  to  8K 
processors,  based  on  a  raw  speed  ratio  of  .846,  a  fraction  of  time  searching  of 
.936,  and  a  utilization  of  .848.  The  work  ratio,  which  is  not  included  in  the 
speedup,  was  .709.  On  the  hardest  problem,  efficiency  was  74%  on  either  8K 
or  16K  processors.  Of  course,  if  a  problem  is  too  hard,  even  high  efficiency 
may  not  be  sufficient  if  the  elapsed  time  is  unacceptably  long. 


8  An  Analytic  Model 

In  order  to  evaluate  these  results  and  compare  various  load  balancing  strate¬ 
gies,  an  analytic  model  for  how  work  is  distributed  and  shared  during  an 
iteration  of  a  typical  parallel  tree  search  algorithm  is  presented.  Initially, 
each  of  P  processors  has  an  amount  of  work  to  do  which  is  independently 
chosen  from  an  exponential  distribution  function  with  a  mean  of  M  seconds: 
e-x/A/  The  exponential  is  chosen  as  the  work  distribution  because  it  is  an¬ 
alytically  tractable,  thus  allowing  us  to  generate  the  optimal  load  balancing 
algorithm  and  compare  its  performance  with  that  £>f  other  load  balancing 
schemes.  Work  is  allowed  to  progress,  with  processors  that  finish  their  work 
becoming  idle,  until  a  load  balancing  phase  is  begun.  The  load  balancing 
phase  takes  L  seconds,  after  which  all  processors  are  reactivated.  When  load 
balancing  occurs  with  less  than  half  the  processors  are  busy,  it  will  take  more 
than  one  communication  to  reactivate  all  the  processors,  since  a  single  pro¬ 
cessor  can  only  send  work  to  one  idle  processor  at  any  given  time.  Therefore, 
L  actually  goes  up  when  fewer  processors  are  active.  This  is  not  taken  into 
account  in  our  model,  but  will  only  effect  the  tail  end  of  most  problems,  since 
load  balancing  is  generally  triggered  with  more  than  half  the  processors  still 
active  (see  Figure  1). 

After  load  balancing,  the  amount  of  work  on  each  of  the  P  processors 
again  comes  from  an  exponential  distribution.  However,  the  mean  of  the  new 
distribution  must  be  chosen  to  reflect  the  fact  that  there  is  now  less  total 
work  remaining.  The  memoryless  property  of  the  exponential  makes  this 
particularly  convenient.  If  the  amount  of  work  on  a  processor  comes  from  an 
exponential  distribution,  then  the  expected  completion  time  of  the  processor 
is  independent  of  how  long  the  processor  has  been  running.  Initially,  the  P 
processors  each  have  an  expected  amount  of  work  of  M  seconds,  resulting 
in  an  expected  total  amount  of  work  of  M  *  P.  Load  balancing  begins  with 
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Figure  3:  Actual  vs.  Exponential  Work  Distribut 


load  balances  performed.  If  B  is  0,  the  run  time  is  approximately  A/* log (P) 

[I'l- 

To  find  the  minimum  run  time  if  exactly  one  load  balance  is  used,  we  must 
compute  the  optimal  trigger  X  at  which  to  perform  the  only  load  balance. 
For  B  =  1,  the  expected  run  time,  not  counting  the  load  balance,  if  we 
trigger  when  X  *  P  processors  remain  busy,  is  the  time  until  the  first  load 
balance  plus  the  completion  time.  The  time  until  the  first  load  balance  will 
be  the  time  for  the  completion  of  the  first  P  *  (1  —  X )  of  P  exponentials  with 
mean  M,  which  is  approximately  M  *log(l/„Y)  [17].  Since  the  work  on  each 
processor  is  reduced  to  M  *  X  after  load  balancing,  the  completion  time  is 
approximately  M  *  X  *  log(P).  This  yields  a  total  search  time,  not  counting 
load  balancing,  of 


M  *  log(l/.Y)  +  M  *  X  *  log (P) 

To  find  the  optimal  trigger  X  in  the  case  where  a  single  load  balance  is 
used,  we  must  minimize  the  value  of  this  equation  for  X.  To  do  this  we  take 
the  derivative  with  respect  to  JY  and  set  it  to  0:  « 

M/X  +  M  *  log(P)  =  0  =>  X  =  l/log(P) 

Plugging  this  back  into  the  equation  for  time,  we  find  that  the  minimum  run 
time  for  1  load  balance  is  : 

M  *  log  log (P)  +  M  *  log (P)/  log(P)  = 

M  *  loglog(P)  +  Af  =  Af  *  (loglog(P)  +  1) 

Now  let  M  *  Zb{P )  be  the  run  time  not  counting  load  balances  for  B 
optimally  performed  load  balances,  running  on  P  processors,  each  with  an 
expected  amount  of  work  of  M  seconds.  We  will  show  by  induction  on 
B  that  Zb{P )  is  independent  of  M.  The  base  cases  are  Zo(P)  =  log(P) 
and  Zj(P)  =  loglog(P)  +  1.  The  amount  of  time  taken  if  i  +  1  optimal 
load  balances  are  performed  is  the  time  to  reach  the  first  optimal  trigger 
JY,  M  *  log(l/X),  plus  the  completion  time  for  i  optimally  performed  load 
balances  on  a  work  distribution  with  a  mean  of  X  *  M  seconds.  Using  the 
assumption  that  Z  is  independent  of  the  mean  of  the  work  distribution,  the 
second  term  is  X  *  M  *  Zi(P).  The  total  time  taken  is  thus 
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efficiency  as  halving  the  number  of  processors,  since  the  amount  of  work  per 
processor  doubles  in  either  case.  For  small  problems,  which  result  in  low  ef¬ 
ficiency,  the  amount  of  work  per  processor  can  be  increased  bv  reducing  the 
number  of  processors.  Low  efficiency  tells  us  that  the  problem  is  too  small  for 
the  number  of  processors  used,  and  reducing  the  number  of  processors  may 

not  greatly  affect  the  time  taken  by  the  algorithm  because  of  the  increased 
emciency. 

The  three  best  approaches  all  perform  very  well,  however,  both  the  opti¬ 
mal  load  balancing  formula  and  the  optimal  constant  triggering  formula  rely 
on  the  fact  that  the  distribution  of  work  associated  with  each  processor  is 
drawn  from  an  exponential  distribution  with  a  known  mean.  The  dynamic 

triggering  formula,  however,  does  not  make  this  assumption,  and  thus  is  more 
general. 

Figure  5  compares  a  line  representing  the  simulated  results  for  the  model 
of  dynamic  load  balancing  (sections  8  and  5.2)  with  data  points  representing 

the  efficiency  of  SIDA*  with  16K  processors  on  all  100  Fifteen  Puzzle  problem 
instances. 

The  main  reason  that  the  results  do  not  match  the  model  very  well  is 
that  the  model  makes  the  optimistic  assumption  that  the  workload  comes 
from  an  exponential  distribution.  Furthermore,  this  distribution  may  change 
over  the  course  of  a  solution  as  load  balancing  is  performed,  thus  decaying 
even  further  from  exponential.  Another  reason  is  that  the  model  computes 
efficiencies  for  a  single  iteration  as  opposed  to  the  multiple  iterations  of  an 
iterative-deepening  search. 


9  Related  Work 

Related  work  on  SIMD  search  includes  our  own  preliminary  work,  a  best-first 
algorithm  based  on  A*,  a  another  depth-first  algorithm  based  on  IDA*,  and 
a  brute-force  depth-first  search. 

Preliminary  results  of  our  current  effort  appeared  in  [25]  and  [26],  which 
describes  the  overall  structure  of  our  algorithm,  together  with  an  implemen¬ 
tation  of  the  distribution  and  search  phases,  including  load  balancing  during 
the  distribution  phase. 
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Figure  5:  Theoretical  and  Empirical  Results  for  Dynamic  Trigger 
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9.2  IDPS 

IDPS  (Iterative-Deepening  Parallel  Search)  [20]  is  a  SIMD  search  algorithm 
based  on  IDA*,  also  using  the  Fifteen  Puzzle  domain,  and  developed  con¬ 
currently  with  SIDA*.  Like  SIDA*,  it  uses  an  initial  heuristic  distribution 
of  nodes  to  processors,  followed  by  alternating  phases  of  search  and  load 
balancing. 

There  are  several  differences  between  IDPS  and  SIDA*.  SIDA*  generates 
nodes  by  making  changes  to  a  single  copy  of  the  state,  and  maintains  the 
structure  of  the  search  tree,  whereas  IDPS  generates  new  nodes  by  copying 
the  entire  state,  and  discards  a  parent  node  after  all  its  children  have  been 
generated.  It  keeps  track  of  the  solution  by  storing  the  moves  made  with 
each  child  node.  In  addition,  IDPS  passes  multiple  independent  nodes  to  a 
single  processor  during  load  balancing,  while  SIDA*  only  passes  single  nodes, 
or  children  of  the  same  parent.  Finally,  IDPS  does  not  perform  between- 
iteration  load  balancing,  and  performs  within-iteration  load  balancing  as 
soon  as  any  processor  becomes  idle. 

Mahanti  and  Daniels  [20]  report  an  overall  efficiency  of  7.6%  on  16K  pro¬ 
cessors,  and  92%  for  8K  processors,  based  on  the  same  100  problems  from 
[18].  These  values  are  not  directly  comparable  to  our  corresponding  effi¬ 
ciencies  of  57%  and  67%,  however,  for  several  reasons.  The  first  is  that 
they  include  the  work  ratio  in  their  efficiency  measure.  In  fact,  they  achieve 
an  average  superlinear  speedup  (efficiency  greater  than  100%)  for  problems 
larger  than  20  million  node  generations,  which  requires  a  work  ratio  greater 
than  one.  The  second  difference  is  that  our  speed  is  based  on  a  serial  IDA* 
speed  of  118  nodes  per  second,  whereas  the  speedup  of  IDPS  is  based  on  a 
serial  IDA*  speed  of  59  nodes  per  second,  on  the  same  machine[21].  How¬ 
ever,  some  of  the  optimizations  in  our  IDA*  code  may  apply  to  IDPS  as  well, 
without  effecting  its  efficiency.  While  we  don’t  have  timing  data  on  IDPS 
for  16K  processors,  the  two  programs  took  roughly  the  same  amount  of  time 
(21.66  hours  for  SIDA*  vs.  22.56  hours  for  IDPS)  to  solve  all  100  problem 
instances  on  8K  processors,  despite  the  fact  that,  due  to  a  low  work  ratio, 
SIDA*  generated  over  30%  more  nodes[2l]. 
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pha  and  beta  bounds  that  are  produced  by  the  search  of  one  subtree  effect 
the  efficiency  of  searching  later  subtrees.  As  a  result,  these  bounds  must  be 
propagated  throughout  the  tree,  and  a  parallel  algorithm  will  do  more  work 
than  a  serial  algorithm. 

Another  example  in  this  class  of  algorithms  is  depth-first  branch-and- 
bound  for  combinatorial  optimization  problems,  such  as  the  traveling  sales¬ 
man  problem.  Since  SIDA*  maintains  the  internal  structure  of  the  tree,  it 
should  be  extensible  to  branch-and-bound  algorithms  as  well. 

11  Conclusions 

SIDA*  achieves  an  efficiency  of  67%,  for  a  speedup  of  over  5500  on  8K  pro¬ 
cessors,  and  57%,  for  a  speedup  of  over  9300  with  16K  processors  on  the 
Fifteen  Puzzle.  This  demonstrates  that  SIMD  machines  can  be  effectively 
used  to  search  irregular,  dynamically  generated  trees.  As  SIMD  machines 
increase  in  size  and  speed  relative  to  uniprocessors,  their  use  for  heuristic 
search  applications  should  become  increasingly  cost-effective. 

The  more  general  approach  of  alternating  phases  of  work  followed  by  load 
balancing  should  apply  to  other  irregular  computations  on  a  SIMD  machine, 
such  as  Monte-Carlo  methods[12].  We  present  a  very  general  technique  for 
deciding  when  to  halt  work  and  begin  load  balancing.  This  dynamic  load¬ 
balancing  trigger  is  simple  to  compute,  makes  locally  greedy  decisions,  and 
achieved  relatively  high  overall  utilization  with  modest  overhead  in  our  ex¬ 
periments. 
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Key: 

dn.  problem  number  ordered  by  difficulty  of  [Kor85] 

pn.  problem  number  of  [Kor85] . 

W (IDA) .  nodes  generated  by  serial  IDA*, 
w  nodes  generated  by  SIDA*. 

T  elapsed  time  of  SIDA*. 

R  raw  speed  ratio. 

F  fraction  of  time  working. 

U  utilization. 

N  work  ratio,  ratio  of  serial  to  parallel  nodes  generated. 

E/N  efficiency  without  work  ratio,  E/N  *  R*F*U. 

(totals  count  all  100  problem  instances  as  one  large  problem) 
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